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ABSTRACT 

This paper is an ardent plaa for simplifying 
experimental design and the associated statistics. The emphasis is on 
design itself. Traditional designs from simple to complex and 
reviewed and the simplest , most basic ways of handling the data are 
presented. Design is stressed in such a way that simple statistics 
follow. The intactness of inspectional analysis is heavily stressed. 
Assessment of experimental outcomes in terms of both consistency and 
magnitude measures is considered at length. The necessity of 
examining the data from all angles is indicated. The basic role of 
design and the secondary role of statistics is discoursed on at 
length. (Author) 
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ABSTRACT 



This paper is a follow-up to one written ten years ago in 
which an ardent plea and pitch was made for simplifying experimental' 
design and the associated statistics. The plea is repeated here long 
and loud. The emphasis in the earlier paper was on statistics per se . 
Here it shifts to design itself. Traditional designs from simple to 
complex and reviewed and the simplest, most basic ways of handling the 
data are presented. Design is stressed in such a way that simple sta- 
tistics follow. The Virgo Intacta of inspectional analysis is heavily 
stressed. Assessment of experimental outcomes in terms of both con- 
sistency and magnitude measures is considered at length. The necessity 
of examining the data from all angles is indicated. The basic role of 
design and the secondary role of statistics is discoursed on at length. 
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QUICK AND DIRTY STATISTICS REVISITED : 
THE USES AND ABUSES OF STATISTICAL 
ANALYSIS IN BEHAVIORAL RESEARCH 

BACKGROUND 

About 12 years ago I wrote a paper entitled "Quick and dirty statistics" 
(Jenkins, 1955)- By that time I had become fed up with textbooks in "experi- 
mental design" that dealt almost entirely with statistics and had little or 
nothing to do with design per se . Design is a first orde*: of business and 
has its own special problems; statistics are a long second and are determined 
by the design layout. This point seems obvious, but maybe it isn't. In any 
event, the purpose of the original paper was to provide researching graduate 
students with shortcut, rough and ready method? of treating data so they 
could spend minimal time on analysis and maximal time on research - the pro- 
per province of behavioral science. The paper was never published; it was 
too big and bulky, containing too many tables. Furthermore, I didn ! t feel 
like going through the nitpicking process of publication either journal or 
book. An abbreviated edition of it was issued for hospital personnel in- 
terested in research under the heading "Shortcut techniques in the treatment 
of experimental results" (1956). 

Another instigator, tying in with the first, was the continuation of 
a trend I deprecated in another unpublished paper of some 12 years ago, en- 
titled "On the worship of large numbers" (Jenkins, 1955). Large numbers are 
real, but not divine. If behavioral scientists paid the respect to chance 
that they pay to large Ns, the field would be farther advanced and, more 
importantly, fewer papers would clutter up the journals ♦ Part of the mystique 
(or possible the psychopathology ) of the behavioral scientist is his magical 
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faith that large Ns will somehow accomplish something. They do: more work 
for E. 

A third item triggering off this paper was a brochure recently handed 
me at a boat show dealing with Tennessee beer tax facts for 196U. It con- 
tains interesting data relating beer consumption to tax rate. The state- 
ment is made: "States that have the highest total tax rate (state, city and 
county) on beer generally have the lowest per capita consumption rate". (It's 
obvious the pitch is for reduced beer taxes in the state of Tennessee.) I 
have a powerful aversion to the word "generally". "Generally" speaking, the 
word "generally" is loose, sloppy, vague and misleading. 

For these reasons, this paper was written. The original Quick and Dirty 
manuscript was short on words and long on tables; the present one is long on 
words and short of tables. It is not immediately obvious which approach 
changes more behavior. 

This paper could have gone under the guise of several other titles: 
"Statistics and other minor methodological matters"; "Why mess with compli- 
cated statistics when simple ones will do? ri , "Large numbers really don't 
make that much difference"; "Statistics in proper perspective"; "There is 
no magic in statistics or large numbers"; "Statistics made simple"; "How 
not to analyze data''; "Mistakes we make in treating experimental results"; 
"The making and breaking of the statistical habit"; "Statistics are real, 
but not divine"; "How to read data"; "Statistics the eacy way"; "The com- 
plete guide to understanding numbers"; and so fc^th. 
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LAYOUT 

There are two basic types of set-ups to determine whether covariation 
exists between a stimulus dimension and seme measure of behavior. The first 
is the classical experimental one in which variables are manipulated and 
functional relationships emerge or not as the case may be. The other is cor- 
relational in which measurement (but not manipulation) of two variables is 
taken and the intensity of relationship or association between them determined. 
The present write-up will consider both. 

There are two additional aspects to most data thatrequire consideration. 
The effect of an experimental treatment can be "whopper", i.e., large dif- 
ferences in magnitude among the severa.l conditions. Or it can be consistent 
with every S or pair of Ss showing the impact of the treatment. The two in- 
dices can be independent, e.g., small magnitude, but high consistency, but 
in the limiting case they converge, e.g., when magnitude is large, con- 
sistency is high. Investigators should always consider both these aspects 
of their data. Both will be examined in the present context. 

To facilitate communication, it might be helpful to spell out the types 
of experimental designs to be considered in later sections of this paper - 
not necessarily in the order given below. The design obviously fixes the 
limits of the class of statistical analysis to be applied after the data are 
in; the nature of the data, convenience and personal preference determine 
what specific class members will be employed. The breakdown of the designs 
follows. It includes most of those commonly used. 



A. One Dimension of Experimental Variation 

1. Two groups : independent or randomly assigned groups, 
matched groups; matched pairs or self-control. 

2. More than two groups : single classification analysis of 
varia.-.ce (anova) . 

B. Two "Simultaneous" Dimensions of Experimental Variation : The 
Effects of Two Variables and Their Interaction 

1. Anova for c orrelated data : matched trios or self-control. 

2. Repeated measurements : independent groups treated across 
blocks of trials or time. 

?. " Simple " Ana lysis of Covariance ( Ancova ) ; partialling out 
pre -treatment differences from treatment measures. 
k. " Simple " Factorial Design: two experimental treatments 
applied "simultaneously" . 

C. More Than Two "Simultaneous" Dimensions of Variation 

1. Complex Anova : three or more variables and their inter- 
actions . 

2. Complex Ancova : correcting differences in treatment meas- 
ures for differences in two or more initial pre-treatment indices. 

WHAT ' S WRONG WITH TRADITIONAL STATISTICS ? 

There are many things wrong about traditional statistics. For one thing 
they take too long. For another they're difficult to communicate. But the 
main thing wrong with them is that they lose sight of the behavior of organ- 



isms. Statistics are tools to help simplify and clarify behavioral meas- 
urement. Jf they do less than this - and they frequently do - they detract 
from rather than contribute to the detection of behavioral principles. 
Tables of sums of squares, degrees of freedom, interactions and the like 
are dandy and elegant, but they tell nothing about the behavior of individ- 
ual organisms. As a matter of fact they obscure and confound it. So why 
use them? Without going psychoanalytic, psychologists seem to possess some 
blind faith that fancy statistical analysis will produce an emergent from 
the data, will refine and go beyond them. This is clear nonsense. The fault 
is not really in the statistics, but in the design and most probably in the 
problem selected and particularly the corner into which investigators paint 
themselves by their selection of experimental treatments and behavioral meas- 
urements. Be that as it may, it seems to be a case of "Please don't eat the 
statistical daisies". 

There is another way. Problems can be selected and experiments designed 
so that simple enumerative statistics can be employed. Count statistics are 
what count - in more ways than one. With small Ns a quick look-see will im- 
mediately reveal how many Experimental cases exceed the highest or average 
Control case. It's a matter of how to analyze data without really trying - 
or at least without really working at it. host numbers are simple, but they 
can be made complicated and even incomprehensible by appropriate statistical 
manipulation. These's an old Balkan saying: "There are a thousand doors to 
let out life, but very few to let it in". Similarly, there are many ways of 
cutting, slicing and working over data, but few of them carry the message of 
clarifying and simplifying the original numbers representing the behavior of 
individual organisms. 



VIRGO INTACTA : INSPECTIONAL ANALYSIS 

While the phraseology may be redundant, "intact virgin" is an accurate 
description of the state of the art in looking at behavioral results. Of 
course, it's true that if one is attempting to relate 10 "personality" meas- 
ures to 10 "perceptual" ones simultaneously, it's not easy to scan the data 
to see what's going on. Ignoring the limit and considering the straight- 
forward instance, the first step in any treatment of data is visual scan- 
ning, a looksee inspection, to determine what the naked eye can find. (Vis- 
ual "sequential" or "trend" analysis a la Skinna* cumulative recordings are, 
of course, highly desirable.) If this procedure yields little return, then 
it seems unlikely that any amount of complicated statistical torturing of 
the data will help. Besides, negative findings are real and basic, ..o show 
a variable has little behavioral iupact over a wide range may be more impor- 
tant in many instances than teasing out a large - N difference barely at- 
taining the 5% level of significance. Enormous time and effort can be saved 
by the simple device of inspectional analysis. If half the Ss produce in- 
creased behavior and the other half decreased, why analyze further? Or if 
half a set of correlations of Chi Squares or any other index are positive 
and the other half negative, isn't this chance finding meaningful in itself? 
Again, if means differ by a couple of points and ranges amount to several 
hundred, there is no statistical way of squeezing anything from the data. 
More importantly, there is no reason to analyze. The numbers descriptive of 
the behavioral events that occurred stand on their own little feet. 

One point that is puzzling in this connection is v;hy drawing conclusions 



about experimental agreement with chance isn't worth doing. A real chance 
finding is, by definition, a rare event and calls for considerable comment. 
I have been shown a set of 100 Chi Squares, half positive and half nega- 
tive, (disregarding magnitude) and have been most impressed with chance while 
the exhibitor of the numbers strongly desired to make something of the hand- 
ful of large values indicating a positive relationship. Chance is real, but 
hard to come by. vlhen it occurs in pure form it surely warrants comment. 

For purposes of dialogue I sun oversimplifying to some extent, but not 
overly. In many instances a quick and dirty check of the data answers + ae 
question asked. On other occasions, of course, manipulation mur + De resorted 
to - minor in a number of cases. Belcv; are given a small ^t of numbers that 
superficially resembles nothing more than a hodgepodge: 



X 


Y 


27 


61 


17 


73 


30 


81 


19 


52 


20 


56 


35 


73 


13 


76 



It is by no means obvious what has happened in these numbers. Arrang- 
ing the X column in order of magnitude or, even better, plotting both sets 
of numbers graphically, immediately clears the air. From a graphical repre- 
sentation it is immediately obvious that a curvilinear, U-shaped relationship 
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has emerged "ith high and low values of X going with high values of Y and 
middle values of X going with low values of Y. Thus as X increases, values 
on Y first decrease, then increase - u straight f orward proposition. A few 
advanced graduate students have failed this type of item on their doctoral 
written examinations because they failed to see the obvious. Experimental- 
ly it has been demonstrated that the same information is communicated many 
times more rapidly and accurately in graphical than tabular form. If quick 
visual check does not provide the immediate answer as to what's happened, 
transformation of numbers to graphical representation will. 

How many behavioral scientists visually cut and sli~e their data before 
feeding it into some sort of machine? Many apparently do not look. The 
aversion toward numbers stamped in by grammar school harridans teaching 
arithiLOtic may well generalize. The safe way is let the machine do it. But 
the machine knows nothing of the flaws and foibles of behavior - other than 
those of its programmer who feeds it. This is neither a plea for nitpick- 

ig nor an anti-machine polemic - although there is a place under the sun 
for both. It is an appeal to behavioral invert igators to so select their prob- 
lems and design their experiments that they can get immediate feedback from 
the behavioral data, i.e., see immediately what, if anything, happened. 

REPLICATION 

Psychologists who are supposedly statistically sophisticated exhibit 
a surprising naivete about chance. In the limiting case a coin will stand 
on edge if one flips it enough times. Short of that but still extreme, we 
all are aware that one S's response or. one occasion does not make a behav- 

O 
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ioral phenomenon - unless it happens to be a. record mile run or pole vault. 
What we fail to recognise is that chance is real and five times in 100 will 
produce findings significant at the 5$ level. The only antidote to chance 
is replication. Only if the same direction of effect holds up on two or 
(preferrably ) more occasions can we start to buy the phenomenon, i.e., bet 
heavily that the same direction will turn up on the next experimental oc- 
casion. Chance will on a very few, fortunately rare, occasions produce an 
inverted generalization decrement function or greater resistance to extinc- 
tion after 100$ rather than partial reinforcement. What we are betting on, 
however, is the bulk of the instances, the preponderance of the evidence. 
"Replication" with variation adds generality to the effect and relieves 
boredom for E. If or.e wishes to maximize chance, don't replicate and draw 
conclusions; if one wishes to minimize caance, replicate before drawing 
conclusions so that "data drift" is forestalled or at least uncovered. 

BACKSLIDING , DATA DRIFT , REGRESSION AND CHANCE 

One classic example of backsliding is an investigation during VJ.W. II 
where a number of physiological measures were applied to a small sample of 
pilot trainees. One hundred rvasurec were used on 20 pilots and correla- 
tions were computed against the criterion of pass -fail in flight training. 
By judicious selection, the investigators were able to cull out three meas- 
ures (of the 100) that generated a multiple correlation of about .98- They 
drew sweeping conclusions. They were asked, of course, to replicate and 
they did, reporting another mult i pi? R of .97. It was, of course, based on 
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three entirely different measures from those of the first study. When 
the original three measures were employed with the second sample, the 
multiple correlation naturally became .00. Clearly, their measures were 
useless in this context. This is a beautiful case of the operation of 
chance. The investigators so stacked the chance cards against themselves 
that they couldn't possibly win. One should give chance a chance, but 
not maximize its operation. 

Likewise, as a tour de force , I once analyzed the boxscores of 10 
baseball games to determine the relationship between winning or losing and 
number of placers employed on the expectation that the losing team throws 
in more players. The first time I did ic the Phi Coefficient came out 
around .90. This looked too good so I took 10 consecutive sets of 10 ball 
games each and appliec the same procedure. The resulting Phis were: .50, 
• 30, .50, .61, .1+0, .31, -30, .20, and .73- The average of these is a 
shade above .U0, considerably less than half of the original correlation. 
Again, instances could be multiplied, out the point is clear: chance is 
real . 

Again, in another context I have related the amount of money raised 
to the number of reported cases of various diseases and disorders such as 
cancer and polio for the year 1958. The numbers are confusing, but fascina 
ting. Correlational procedures applied to them yield a Phi of -.25 based 
on a cut-off at the means, one of plus .20 with a cut-off at the medians 
and a rank order Rho of -.U3. Furtner sets of figures and replication are 
clearly needed . 

I was once presented with a mass of t-ratios relating personality meas 



all- 
ures to perceptual test performance. Overall, there were 93 positive and 
6l negative. For male Ss, hj were positive and U6 negative. The investi- 
gator wanted to draw conclusions regarding the largest positive values and 
was quite disappointed when the action of chance was indicated. 

A crisp example of the misleading nature of certain relationships re- 
flected in correlations is the figure of ,97 reported by Locke (1961) between 
the number of letters in 29 Ss 1 last names and certain adjectives descriptive 
of "personality" . It was based on a thorough item analysis a:id item selection 
leaving the door open, of course, as Locke planned to backsliding. Ss with 
longer last names were gay and impulsive, talented and God-fearing, did not 
smoke or used filters, have more dental fillings, like vod^a and have hair 
of a different color from their fathers'. The reliability of the final list of 
adjectives, incidentally, was only .67- After maximizing the possibility for 
regression, Locke found, on cross-validation with an N of 30> an overall cor- 
relation of -.80. The initial findings were obviously attributable to maxi- 
mising the role of chance. 

Then there is the matter of extrapolation. Many popular writers have 
paid lip service, with due cause, to the dangers inherent in statistical ana- 
lysis. A book has even been published, entitled " How to Lie with Statistics " . 
One article on this matter had the following section headings: the unspeci- 
fied average, the biased sample, the improbably precise figure, correlations, 
the gee-whiz graphs, and semantic tricks. 

These are all gimmicks and correct as far as they go. Numbers are 
slippery things. One has to study them, not take some one else's word for 
what they add up to* One does not believe graphs that show the 1500 me- 
ter Olympic run will clock no time at all in the year 2,?50, the American ski 



jumping distance r'.cord to be one mile in the year 2153 » not (possibly- 
more plausibly) the Indianapolis Speedway record being 1C00 m.p.h. in the 
year 2397. Another case in point are the height/waistline ratios of Miss 
America winners over the past ho years. Weir.traub and Eisenberg (1966) 
have pointed out regarding extrapolation of these figures: "It is ob- 
vious that the height /waistline ratio cannot be a linear function of 
time; women were not wider than they were tall several hundred years afV 
(p. 2H7). Plausibility is one thing; gullibility another. 

T i; E SMALLER THE N THE BETTER 

Standard textbooks on statistics (seme incorrectly titled "Experi- 
mental Design") pontificate the case for massive sampling. Their argu- 
ment seems to be that the effects of chance are somehow diluted or erased 
by the magic of masses of information. In the first place, faulty experi- 
mental design in the way of failure to control a variable simply multiplies 
itself with increasing N. Secondly, the more importartly, why should 
chance operate to a greater extent with small Ns? Chance is not a God 
peering over E's shoulder saying "I'll make this case deviant, that one 
average." In a lottery the laws of chance are indifferent to the name of 
the winner. Chance doesn't work this way. Furthermore, the overwhelming 
point is that statistical results "significant" on a small number of cases 
add up to a lot more behaviorally than the same finding with a large sample. 
A probability of .05 derived from two samples of three cases each means 
non-overlapping behavioral measures. The same probability accruing to two 
samples rf 300 cases each means little except grossly overlapping distri- 
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butions with no possibility of individual prediction of behavior. These 
points do not deny that a large sample fills in the pic+ure of the Uni- 
verse to a greater extent, but after all the primary focus is on behav- 
ioral not statistical principles. 

An equally powerful case can be made on the other side of the coin 
for large samples, large samples, that is, of the behavior of individual 
organisms. A large sample of behavior from a small sample of Ss coupled 
with a big impact of the experimental treatment is the American psycholo- 
gist's dream. Covering a wide range of values of the experimental treat- 
ment along with careful selection of a behavioral measure sensitive to 
the treatment, repeated measurement and replication will head the inves- 
tigation i.i the right direction. 

THE CASE 0 F DEVIANT CASES 

By dint of studying individual behavior we must be concerned with 
those cases that fall outside acceptable limits. From a statistical 
standpoint these stragglers or outliers are a problem; there are dozens 
of statistical gimmicks and procedures for excluding them from the final 
analysis. None of them are behaviorally satisfactory however. The in- 
vestigator, after throwing out such a case, is always left with the gnaw- 
ing doubt that he has overlooked some angle or other. The problem is pa\"- 
ticularly pressing when K is small, say three or four cases. The present 
viewpoint is that these deviant cases may be more important than the non- 
deviant ones. What stimulus circumstances produced the unusual behavicr? 
The matter hinges in part on" the definition of che word "deviant 1 '. Here 
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Here it is taken to mean unusual, infrequent and rare rather th/n the 
abnormal implied by it's common usage. Being elected President of , he 
U.S. is infrequent, but would hardly be considered a piece of abnormal 
behavior in the clinical sense. 

The present view is that these unusual cases, particularly in small 
N studies, should be subjected to careful experimental scrutiny for their 
own sake. In them may lie the answers to ; number of pressing experimental 
problems. In a similar vein one might wish to investigate the background 
and current status of the greatest acrobat or pianist in the world. They 
are certainly deviant in a frequency sense; they occur most rarely. 

NOSE COUNTING AMD THE BINOMIAL EXPANSION 

Probably the simplest and most efficient analytical tool available 
is th^ binomial expansion. It can be used any time the design calls for 
a chance baseline, but usually is used in the 50-50 case. For instance, 
if we simply wish to know whether learning occurs under a given set of 
operations, all we need do is count the number of Ss that show the in- 
crease in response strength classed as "learning". If five of five Ss 
respond more frequently after we've applied an experimental treatment, the 
odds are 1 in 32 against "pure" chance generating our event. The binomial 
is comprehensible to the layman and has been easily taught to eight year 
olds . 

The binomial is simple and obvious. Anyone can understand it. The 
rub comes in knowing when to apply it. I have seen a number of instances 
where investigators have the perfect set-up for this kind of count statistic 
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and proceed to fall away into complex analyses that tell them far less 
about v/hat has happened than the binomial would - and take many times 
as long to apply. Investigators should be primed, and even design their 
experiments, so that such simple analytical tools as the binomial can be 
applied in just one small extension beyond inspecting the behavioral out- 
come of the investigation. 

The case so far has been kept to its simplest form. Where there 
are reversals, e.g., one event in the seven goes in the opposite direction 
from the other six, simple tables are available in several standard text- 
books and detailed tables for small Ns are reproduced in the original 
Quick and Dirty manuscript. 

For illustrative purposes there are reproduced in Table 1 some real- 
life data deriving from an investigation of skid-row alcoholics. The num- 
bers represent University of Tennessee Deprivation Scale scores which re- 
flect the presence or absence of environmental sapport from family, friends, 
job, etc. Individuals -.coring high on a drinking scale (see Pascal and 
Jenkins, 196l) were matched on age, sex, vocation and education with in- 
dividuals scoring lev nr. this scale. They were then compared on environ- 
mental deprivation, 

VJithout any analysis, it is eye-catching and immediately obvious that 
each alcoholic score is considerable higher than that of his control partner 
on the Deprivation Scale, As a matter of fact the two distributions do 
not overlap. Thus 10 in 10 events go in the same direction and the odds 
of a chance finding are 1/102U or P of about .001. Nothing could be simp- 
ler and no further analysis is needed. It should be noted that this ana- 



Alcoholic and control Ss matched "by pairs on age, sex, vocation and 
education, compared on University of Tennessee Deprivation Scale 
scores. 

(Pascal and Jenkins, i960) 



PAIR _ ALCQ50LIC COHTaOL 

1 10 5 

2 12 6 

3 12 2 
k 10 2 

5 14 2 

6 13 »+ 

7 12 2 

8 14 3 

9 12 1* 

10 a 4 

Mean 11.7 3.4 



P « 1/1021* 
' .001 
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lysis and all of its kind focuses on consistency without regard to magni- 
tude. In this instance^ consistency 4s the main poiht and magnitude 
is of little consequence u One might devote considerable effort on 

applying a match-pair t-test to these data, but the outcome would remain 
the same. This is not saying the effect isn't large; mean differences 
are of the order of three to one and the two distributions do not overlap 
a rare "whopper" finding. 

Presented in Table 2 are some numbers from an experiment by Carter 
and Schooler (19U9) dealing with "Value, need and other factors in per- 
ception". Without belaboring the questionable behavioral status of these 
terms, the conclusion is drawn that "the rich end poor children's judg- 
ments were essentially the same....". This conclusion is incorrect. There 
are five events (coins) and in every instance the average judgment of the 
poor children was larger than that of the rich. Five events in the same 
direction occur only l/32 times on e chance h" is. Thus the consistency 
looks potentially real although the magnitude is admittedly small. Both 
sides of the analysis coin - magnitude and consistency - must be examined 
if the data are to be squeezed dry. In this case, essentailly "no dif- 
ference" was concluded where perfect consistency exists. Instances of this 
point could be multiplied, but the matter should be clear. 

Another case in point involves some data based on Sheldon's somato- 
type measures and anthropometric variables as they relate to the criterion 
of success or failure in flight training during World War II. The bi- 
serial correlations between his 12 measures and the criterion were as fol- 
lows : 



Table 2 

Average judgments of coin size in ixiliimeters by rich and poor 
children. 

(Carter and Schooler, 19^9) 



DIME PEMWY NICKEL QUARTER BALFDOLLAR 

S IZE 17.8 19-0 21.2 21+.1 30.5 

Rizb 16.3 17.6 21.0 25.U 33.1 

toor 16. 5 18.6 21.2 25.7 33-9 

t .5 2.2 .3 -5 .8 
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-.10 



.08 



.05 



.03 



.11 



.06 



.03 



.01 



-.07 



.08 



.02 



.11 



The absolutely low level of these correlations is not surprising 
since these were highly selected individuals and the distributions were 
compressed and truncated. The eye-catcher is the pivoting of the numbers 
around zero. Four are negative and eight positive with a mean of about 
.027. It seems unlikely that prolonged statistical manipulation will yield 
much beyond the conclusion of a near chance finding. 

Table 3 contains some numbers based on quite complex procedures 
(Pascal et a^L, 1966). They represent average ratings over a number of be- 
havioral variables from S's report of the behaviors exhibited by his parents 
toward him in the early years of his life. In other words, they are a 
large sample of behavior from a small N. All Ss had surgical intervention 
for their ulcer symptoms so they are very homogeneous in this regard. 
Despite this similarity, considerable difference emerged between those who 
lost their ulcer symptoms after surgery and these who did not. Since 
matched pairs were involved the binomial analysis can be applied. These 
numbers were selected because they present complications. In the first 
instance there is a tie for the average ratings for Pair 7 for the stimu- 
lus category "Mother". By reference to the appropriate binomial table 
the chances are 11/102^+ of getting nine events in 10 in the same direction 



Pascal-Jenk.i ns 

Ulcer patients 
ber res peed? i»g 



Scale reViogs for Mother and Father for 10 pairs of 
iaatched on sex, age, vocation and education, one mem- 
succfc.;sfuliy to ulcer surgery, the other failing. 



(Pascal et al, 1966) 



PAIR MOTHER 

Success Failure 

1 2.8 1.5 

2 2.8 1.9 

3 2.7 2.4 

u 5.0 1.9 

5 2.9 2.7 

6 2.6 2.6 

7 2.9 2.9 

8 2.8 2.1 

9 2.6 2.1 
10 3-0 1.8 



FATHER 

Success Failure 
1.0 1.6 

2.7 1.8 

2.8 2.3 
3.0 1.6 
2.2 2.2 
2.0 2.2 

2.9 2.7 
2.3 

2.2 2.3 
3.0 2.0 



Mean 



2.9 



2.2 



2.0 



The chances of 10/10 are 1/1024, The tie is split in half by averaging 
these two probabilities with the outcome's being 6/1024, One could, of 
course > throw all ties against oneself, but this seems like too much deck 
stacking. 

The ''Father " case for Pair 7 is even more complicated, tnere beirg 
three reversals and one tie. The tie is treated as in the previous case. 
The chances of obtaining 10/10 events in the same direction are 1/1024, 9 
are 10/1024, 8 are 45/1024, 7 are 120/1024 and 6 are 210/1024. Remember- 
ing that we always want the probability of an event as extreme or more ex- 
treme, the total probability for 7 or more events is 176/1024 while the 
odds for 6 or more events are 386/1024. Averaging out these last two 
figures (176 and 386) we obtain an overall figure of 28l/l024. About 
280 times in a 1000 chance would produce a result like this. 

What all this verbiage and artful number management adds up to is 
what one can see with the naked eye: there is really only a slight dif- 
ference between Successes and Failures as regards "Father", (in defence 
of the investigation, these are the "worst" set of data selected from a 
number of experiments.) Again, the point is clear. Without any particular 
statistical sophistication, one can scan a complex set of data and see 
what's happened to the point of drawing the appropriate and relevant con- 
clusion. Undoubtedly it takes practice. Reasonable advice calls for 
looking at the numbers of published papers, not the words. 

NOSE C OURTING , ASSOCIATION AND CORRELATIO N 

The case of the beer tax facts. One of the itenBthat triggered off 
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this return trip to quick and dir'y statistics never-never land is pre- 
sented as Table I*. Quick inspection of it, particularly the first and 
last numerical "olumns, suggests a substantial relationship of a negative 
nature: the more beer consummed the less the tax, and conversely the 
higher the tax, the le:.s the beer drunk. (The pamphlet accompanying the 
table argues the unfairness of the case, but we aie not concerned here 
"ith economics.) There are over U50 numbers in this table. That is too 
many to analyze unless one is practicing arithmetic. The case is an 
excellent one for applying and demonstrating short-cut procedures. 

Suppose we're simply interested in determining whether this appar- 
ent negative relationship between taxes and beer consumption is "real", 
i.e., is large enough to provide a base for arguing a change in taxation. 
Further, suppose we're interested in the overall tax structure and not 
local matters, and finally suppose we're not good at arithmetic and thus 
want to work with as few numbers as possible to minimize the possibility 
of error. The solution is simple; take the extreme cases from the first 
and last column. If the relationship holds in this sub-sample of data, 
it should hold across the board. 

The only gimmick to watch for here is a curvilinear, say U-shaped, 
relationship where we happen to select data that fits a straight line 
portion of the relationship. (Graphical representation obviously helps 
in this regard.) Inspection clearly indicated no changes in direction in 
trend in the numbers presented. One should always remember that correla- 
tion is nothing more than a number reflecting to what extent high numbers 
in one set go with high (or low) numbers in the other set. The way to 
find out is to look and see. 
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33 Connacttoat 
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38 Nevada 
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42 Ween* Jon 
4-) New Yen 
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45 CaUfbmta 

46 Newjeney 

47 Wleoonatn 
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40 Miaaomi 
90 Wyommf 
51 Hawaii ••• 
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Table 5 presents consumption figures for th? nine states with the 
highest and the nine with lowest tax rates. (Nine is obviously arbitrary; 
it's small enought to simplify arithmetic and the middle score of an odd 
number of measures constitutes an average.) A brief examination of . the 
table reveals non-overlapping distributions: the highest number in the 
first column is less than the lowest number in the second column. For 
those interested in a slightly more sophisticated treatment, a permutation- 
combination analysis of 10 events beating 10 others yields a probability 
around five in a million, a qu^te rare occurence on any basis. (Behavioral- 
ly speaking, the binomial expansion bears on matched or paired events so 
it is not applicable to these two sets of 10 independent events.) In any 
case, these data reinforce the point that a great deal can be read into 
results by careful inspection. 

Numbers are sometimes useful in summarizing findings. In this in- 
stance it would be handy to have a single number to represent the inten- 
sity of relationship, association or correlation between the two dimen- 
sions of variation, tax rate and beer consumption. The easiest way to 
obtain such a numoer is to sort the data into a two-by-two table. Table 
6 represents this transformation for the data of Table 5. The grand mean 
(mean of means) was taken for the two columns of Table 5 and the individual 
cases sorted as above or below this value vhile retaining the original 
classification of high or low tax. A Phi coefficient has been computed 
for the resulting two-by-two sort in Table 6. Phi is easy to compute and 
represents the more elaborate correlation coefficients quite accurately. 
It consists of a fraction, the numerator of which is the difference between 
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Per capita consumption in gallons by the nine states with the high- 
est and lowest total tax per barrel. 



CQBSUMPTION 



Highest Tax 

7.1 

7.1 

5.9 

6.3 

9.2 

6.8 
lk.3 
ik.6 
8.9 



Lowest Tax 



15.2 



19.0 
15.6 

18.9 
26.6 
19.6 
17.1 
lit. 8 



Median 
Mean 
Grand Mean 



7.1 

8.9 



13.5 



17.1 
18.0 
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COMPILATION OF A GOR1.ELATIOM COEFFICIENT ( phi ) 
FOR THE BEER TAX DATA OF TABLE 



GREATER THAN LESS THAW 

GRAND MEAN OF 13.5 GRAND MEAN OF l.-i.'i 



igti Tax 



Low Tax 



Totaj 



M <9)<<0 (H)(7) 79 
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the products of the diagonal numbers. The denominator is the square root 
of the products of the four marginal totals. In this instance a figure 
of -.80 emerges indicating a substantial negative relationship between 
tax rate and beer consumption and, more importantly, clearly supporting 
the inspectional conclusion. 

The usual word of caution is called for regarding the interpreta- 
tion of indices of correlation and association. The easy part is compu- 
tation; the hard part is saying what the produce means. Things go to- 
gether or covary. One does not "cause" the other. There may be a sub- 
stantial correlation between the abortion rate in Brooklyn and the rain- 
fall in Rangoon, but it would be difficult to uncover a cause-effect re- 
lationship. In other words, caveat emptor when it comes to the interpre- 
tation of correlations and other measures of association. For example, 
Sargent (1955) computed the correlation between the number of letters in 
the names of the months and the mean monthly precipitation for 19^7- The 
figure was -,6l with an associated probability of less than .05- The reader 
is left to figure out what the covariation means. 

Thus far we have dealt with instances of nice, clearcut positive 
findings. Inspectional analysis applies equally effectively to negative 
results or cases of "essentially no difference". A case in point comes to 
hand in the way of a study of activity patterns of schizophrenic patients 
(Chappie ejt a}., 1963)* Among many other things, the investigators were 
attempting to be beheviorally economical in seeing if four observations per 
day would suffice instead of six. The differences are presented Table 7 
S by S, for ^our separate days. 



Table 7 



Differences in activity between six and four observations per 
day for 10 schizophrenic pocierrt-.s, 

(Chappie, 1963) 
S DIFFERENCES S DIFFERENCES 



1 6-9 

5 k 

-1 2 

-2 11 



0 70 

5 7 

0 -2 

0 -6 



~S 8 1 

1 -2 

9 18 

-2 k 



6 90 

-2 • 0 

-k .5 

-1 -2 



-4 10 0 

6 -1 

-13 0 

-7 G 
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There are many ways of cutting and slicing these dat9. The in- 
vestigators did it the hard way by doing 10 individual t-tests, one for 
each S. Simple counting of pluses, minuses and zeroes reveals a lk 9 17 
and 9 split for the forty numbers, a finding quite in accord with chance 
expectation. Inspection of the data suggests no large systematic dif- 
ferences; counting supports this view. If one wishes to be a little more 
thorough about the analysis, a total can be taken, S by S. Five Ss show 
negative sums, four positive and one zero. The mean difference is 0«9« 
Again chance prevails by this token. 

At this juncture it seems wise to comment that there are in the 
behavioral world some sets of data that are too complex to be handled by 
inspectional analysis. Factor analytic studies are a case in point. Data 
in behavioral science seem to be more complex the less we know. As know- 
ledge increases, simplicity sets in and the stage is set for once-over- 
lightly kinds of analysis such as inspection. In any event, there is a 
serious question concerning the utility of factor analysis and similar 
cumbersome procedures. They may be a defense, an escape thrrugh the ma- 
chine for the investigator, but they help the audience little. I believe 
that at least one expert in the field said that no worthwhile test has 
ever been developed as a result of a factor analytic study. This sounds 
reasonable . 

Returning to the main stream of this section, some data are shown 
in Table 8 having to do with Experimenter differences. Four different 
E's each tested four pre-school children in a discrimination learning set- 
up. If the child had not learned in 36 trials, testing was ended. It is 



Table 8 



Trials to reach a criterion in discrimination learning for four 
sets of four pre-school children each tested by a different Ex- 
perimenter. Learning was terminated after 36 trials. 



EXPERIMENTER 

S_ 1 2 3 

1 8 20 36 

2 6 17 5 

3 12 5 32 
k 23 12 15 
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clear that E - 3 had one S fail to learn while E - ^ had two. Are the 
four differences large enough to warrant taking action? While there is 
clearly a suggestion that E - k has more difficulty in conditioning chil- 
dren in this situation, no hard conclusion can be drawn. Furthermore, it 
hardly seems necessary to apply Chi Square or other indices of frequency 
difference. Inspection makes E - U stand out. The real question is wheth- 
er he continues to stand out with repeated testing. If two more sets of 
four children each were subjected to the discrimination operations by each 
E and the same trend emerged, it would be most plausible to consider E - h 
as being drawn from a different universe than the other examiners and to 
study him as the variable in the differential findings. 

Unfortunately, the behavioral literature is replete with positive 
examples amenable to inspectional analysis, but the bias about publishing 
negative findings on the part of both the author tnd the editor cuts way 
back the instances of negative findings or small, inconsistent, insignif- 
icant results. Negative findings are on many occasions more important 
than positive ones - they allow us, for instance, to disregard variables. 
A Journal of Negative Findings is still needed. 

There are a. huge number of tests of association, contingency and 
correlation - far to many to even mention in this context. If one wishes 
to use one, the Fisher-Yates Exact Test is recommended for the two-by-two 
set-up. It corresponds to the Phi Coefficient although it yields only 
direct probabilities with no direct indication of extent of relationship. 
It is cumbersome to compute and Chi Square is a fair approximation to it 
and much easier to calculate. Across the board, the Phi Coefficient does 
the job. 

ERLC 



HOW BIG? MAGNITUDE CONSIDERATIONS 

While statistical procedures are continuous and to a large extent 
independent of experimental design (although determined by it) - the t- 
test flows into the F-test, one dimension of experimental variation shades 
over into two and more - it is practically convenient to separate out the 
operations for two groups from those for more than two groups with one 
experimental treatment and, in turn, the latter from situations involv- 
ing tv;o or more dimensions of experimental variation. S'^h a course will 
be followed here. In addition, while related , the statistical procedures 
for two independent groups differ from those for two related groups and 
will be further separated. 

The outline follows of the subsequent sections of this paper deal- 
ing with the statistical assessment of magnitude for the several types of 
experimental design in increasing order of complexity. 

I. The two group case: independent groups, matched pairs or self- 
control design and matched groups. 

II. Anova : one dimension of experimental variation involving three 
or more groups or conditions. 

Ill- Anova: two "simultaneous" dimensions of experimental varia- 
tion: matching or self-control, repeated measurement and "simple" fac- 
torial design. 

IV. Complex Anova: more than two "simultaneous" dimensions of ex- 
perimental variation. 

The emphasis in discussing these procedures, consistent with the 
rest of the paper, will be on easy, error-minimizing, efficient, short-cut 
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ways of treating data and clarifying trends clearly visible in be- 
havioral measurements. 

MAGNITUDE : THE CASE 0 F TWO GROUPS 

1. Independent Groups . The overlap in the various statistical 
approaches is indicated by the fact that several sets of data appropriate 
to this section have been presented in other contexts. For purposes of 
exposition, Table 9 is given in which some modified findings are summarized . 

There are a good half dozen ways to tackle these numbers statistical- 
ly, but, as always, inspectional analysis is numero uno . Significance of 
some kind is clearly shown by the fact of overlap of the two sets of five 
numbers by only one case. The t-test would appear to be the most appro- 
priate analytical technique, but it is the most insensitive, yielding a 
P of only .055 while the Arrangement Technique (diluting the difference 
by putting the tied case for high SES first) produces a value of .008. 
Sorting the data above and below the grand mean (31-5) yields a Fisher- 
Yates P of .02k with a corresponding Phi Coefficient of about .82. (Con- 
sidering time, it took about one minute each for the Fisher-Yates and Phi 
and nearly five nimutes for t-test.) By any token the experimental treat- 
ment of GES has had a large impact on ability to reverse in discrimination 
formation. 

In the previous Q & D paper, considerable space was devoted to the 
Range Test. It is one of many variants of the t- and F-tests based on 
substituting the range for the standard deviation. The usual caution ap- 
plies:, bevard of extreme outliers; a single deviant case can produce in- 



Trials to a criterion in discrimination reversal as a function of 
socio-economic status in pre -school children (hypothetical, doctored 
data based on preliminary findings). 



S LOW SOCIO- HIGH SOCIO- 

ECONOMIC STATUS ECOHOMIC STATUS 

1 kO 32 

2 38 30 

3 36 28 
h $i 25 
5 32 20 

Mean 35 27 



Phi - .816 
P for Arrangement Technique - .008 

P for t-test - .055 
P for Fisher-Yates Exact Test - .02*+ 
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significance where significance really exists. The Range Test is simple, 
efficient and easily understood. It consists of taking the range across 
mean? .or two or more values), multiplying it by the (average) number of 
cases in the samples and dividing the resulting figure by the average 
range in the sample data. In Table 9, the range across means is 9j N is 
5 and the mean range in the samples is 10. The ensuing Range Test value 
is U.5. For two groups and Ns of 10 or less, the resulting value can be 
referred to the t-table. For more than two conditions and Ns larger than 
10, degrees of freedom are computed by multiplying the average number of 
cases minus two by the number of conditions and referring the resulting 
Range Test figure to a special table contained in the previous Q & D man- 
uscript . 

In any event, it is obvious that the Range Test is far more signifi- 
cant and far less time-consuming than the conventional t-test. The P- 
value for the data of Table 9 by this technique is about .001, contrasted 
to the .005 according to the classical t-test. 

The range is a highly useful estimate of variability so long as 
grossly deviant cases are not involved. For example, the range divided 
by N is a close estimate of the standard error of the mean when outliers 
are not involved and short cuts a gocd deal of computational labor in de- 
riving the t-test value. 

Across the beard, the data must be carefully examined and the one 
or two most efficient procedures applied, i.e., those that maximize re- 
turn from the behavioral data and minimize labor and error. 

2. Matched Pair or Self -Control Design. This variant of the two 
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group case is far and away the most efficient if the matching measures 
correlate with the experimental behavior so that variability is cut 
down. A case in point is shown in Table 10 where doctored data make the 
point. Examination of the data shows the E - group exceeding the C - 
group by a small margin. Treated as independent groups, the P-valu6 
emerging from the application of Ihe t-test is -18. It is, however, ob- 
vious to inspection that a substantial relationship holds between the two 
sets of numbers. As a matter of fact, the rank correlation is .88. 
Another obvious point is that six of the eight differences are positive. 
The Binomial Expansion, previously treated in detail, is clearly applicable 
to these data, but yields a P-value of only .ikk. It is to be noted that 
the two reversals are the smallest in absolute magnitude. This situation 
calls for a test sensitive to these magnitudes. The Wilcoxon Rant T-Test 
is appropriate. It involved ranking the differences by magnitude with- 
out regard to sign and sorting out sums of ranks by signs. The smaller 
sum of ranks is then referred to the table presented in the previous Q & 
D paper and in some standard statistics texts. The resulting P is ca. .02. 
This is probably as dry as the data can be squeezed, but to complete the 
picture, classical t was applied and produced a P of .018. 

There are several points here. The first is to match on variables 
that have something to do with behavior in the experimental situation so 
that a correlation in performance is generated. If little relationship 
is produced, time has been wasted in the matching procedure. 

The self-control design is, of course, the limiting and best case 
of matching since each S is more like himself on different occasions than 
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Table 10 

Hypothetical data: The efficiency of matching or self-controlling 
versus independent groups. 



S__ J2__ 

1 20 

2 3h 

3 2k 
k 37 

5 23 

6 35 

7 30 

8 29 



C DIFF 

16 k 

35 -1 

22 2 

29 8 
2k -1 

30 5 
27 3 
25 U 



Mean 



29 



26 



3 
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he is like anyone else. Another point is that if you've matched and a 
correlation has come about to cut down on variability, by all means take 
advantage of it by applying the statistical procedures appropriate to the 
set-up. It is clear in Table 10 that a correlation has emerged as re- 
flected in the greatly decreased variability in the distribution of dif- 
ference scores as contrasted to the spread in the original measures. Thus 
a matched -pair treatment is called for, the binomial for consistency and 
rank T and/or classical t for magnitude Whenever the reversals are small 
in size, the latter techniques - that take magnitude into account - are 
preferrable to the straight count procedure. 

Another situation where magnitude treatment is needed involves very 
small Ns For instance, in an experiment on the combined application of 
reward and punishment in conditioning on extinction responding, two pairs 
of pigeons, operating in standard Skinner boxes, were matched on APR re- 
sponding prior to the use of electric shock. One member of each pair was 
shocked until responding stabilized at circa 3% of its original value. 
Then extinction operations were applied. Total extinction responses in 
11 hours were : 

Pair Shock Non - Shock 

I 6U0 13,690 

II 370 13,lo0 

One hardly need analyze these data; they serve as a tour de force . 
The classical t-test is the only analytical procedure applicable and it 
yields a P-value of ,00k for the one degree of fieedom involved - if one 
is a stickler for statistical protocol. Actually, no analysis is neces- 
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sary ad each pair of birds should be considered a separate experiment. 
The point is made. 

Another, somewhat more dramatic example of the s*me point is con- 
tained in an experiment in crowding the threshold in ECT. Patients ex- 
posed to Electro-Convulsive Treatment exhibit some resistance to the pro- 
cedure, a small part of which shovs up in dela> in insertion of the tongue 
depressor that is used to prevent tongue swallowing during convulsions. 
A student of mine was interested in crowding the threshold on tpis delay. 
He first took "before" measurements, a kind of latency of depressor in- 
sertion. This interval in sec. for the experimental Ss to be trained was 
12, 30 and 11. They were paired with controls with intervals of 5, lh 
and 10. (The cards were deliberately stacked against the treatment by 
having shorter latencies for Control Ss.) The experimental treatment 
consisted of putting dissimilar objects in the mouths of Experimental Ss 
and gradually, keeping the latency short, increasing similarity to the 
tongue depressor. Then tests were conducted in the ECT setting. The 
"after" scores in sec. for the Experimental Ss were 1, 9 and 2; for the 
Controls 13, 30 and 19. 

Before turning to the actual treatment of the numbers, let's look 
at the overall design picture. This experiment can be looked on in a 
quite complex wry - over and beyond the complicated context in which it 
is set. One could argue, admittedly, somewhat irrationally, for an ana- 
lysis of covariance in which the pre -treatment measurements were partialled 
out of the post-treatment ones by considering the covariation between pre 
an3 post treatment scores. Setting aside the question of whether correla- 
tion based on three points means anything, the question remains, is com- 
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plex analysis worth the trouble and will it yield anything beyond what 
is produced by simple analysis? The answer, of course, is no. As a matter 
of fact no consideration was given to standard analysis of covariance, 
but rather the straight forward procedure was followed of converting the 
latency scores into percentage change scores or savings scores from "be- 
fore" to "after". These turned out to be: 

PAIR EXP . CONTR . 

1 92% -i6o# 

2 7C# -llh% 

3 B2% -90% 

The pairing now becomes almost irrelevant because of the size of 
che effect. We have tvo sets of three events failing by a large margin 
to overlap with three other events and P is .05 by the Arrangement Tech- 
nique. For didactic purposes the t-test was applied to the distribution 
of three differences across pairs and yielded a P-value of .008. In this 
case, the training had such a large impact that the correlational feature 
built in by matching was washed out. As a matter of fact the P based or. 
an independent sample t-test is slightly smaller than that occurring to 
the matched t. In passing it might be noted that the Range Test is not 
appropriate to these data because of the great disparity in the sample 
ranges, i.e., 22% versus 70%, but the outcome is consistent with the find- 
ings from the other procedures. 

Again, the reader is advised that the purpose of stat istics is to 
"prove something" - the something his naked eye tells him ha? occurred in 
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the behavior of his organisms. The "telling" is, of course a matter of 
discrimination and, like all discrimination formation, takes time and 
practice. Remember not to bother to analyze if "nothing" has happened, 
if little or no behavioral differential between the groups is clearly 
apparent . 

The data of several of the tables presented earlier in this report 
are amenable to examination by the techniques spelled out in this section. 
It might be worthwhile to look at those numbers in this light. 

3. Matched Groups . On occasion it is possible to reap experimental 
and statistical benefits from group matching where individual pairing is 
not possible. In group matching, equivalence is achieved in the mean and 
standard deviation of some a priori measure known or thought to correlate 
with behavior in the experimental situation. It is a less precise and 
sensitive measure than pairing which in turn is less exact than use of the 
self-contro] procedure. If, however, behavior on the group matching var- 
iable relates to the experimental measurement, there is a cue back in 
variability and a corresponding gain in statistical sensitivity and pre- 
cision, i.e., the P-value is decreased. Group matching is employed for 
several reasons. Amcng them are large Ns where pairing is overly time- 
consuming; time .limitations where Ss, say, go directly from r-nditioning 
into extinction and time does not permit matching and loss for some reason 
of one number of an already matched pair. 

The statisitcal procedures for analyzing data by the matched group 
technique are spelled out in most statistical textbooks. Here, suffice 
it to say that the overall correlation for both E and C groups combined 



is computed between the "before" and "after" measurements. There is one 
real potential gimmick in computing such a correlation. By the nature of 
the experimental treatment , it sometimes happens that the relationship be- 
tween the matching measure and the criterion is thrown off by the treat- 
ment so that differential correlation across the E and C groups emerge. I 
have seen data where r is .80 in the C-group and near zero in the E-group, 
In such cases pooling the numbers for correlational purposes appears ques- 
tionable. One could argue for computing the correlation separately and 
combining correlations by z-transformations , but this seems to be a rather 
sticky refinement. The investigator must decide whether to forgo his 
matching in cases such as this or simply report the differential correla- 
tion and go ahead and combine anyway in order to gain whatever precision 
and increase in sensitivity accrues to the matching. In any event, he is 
obligated to examine closely the relationship between the two variables 
separately for the E and C groups. (This matter will be considered again 
in connection with the analysis of covariance in a later section. ) 

An example of the use of the group matching procedure is contained 
in an experiment dealing with the hors d'oeuvre effect of pre feeding 
pegeons operating in a Skinner box. Initially, the design called for 12 
pairs of pigeons matched on responding in conditioning to be divided into 
two experiments of six pairs each. One pigeon was ailing and did not com- 
plete conditioning and had to be dropped from the experiment. Fortunately, 
this bird was near the middle of the distribution, so rather than discard 
his partner, group matching was used. 

The experimental treatment consisted of pre-feeding 11 of the 23 



birds an amount of food that increased their > 3y weight approximately 
1.5% prior to an extinction test. The experiment aimed at one test of 
the drive-reduction reinforcement position, which assumes that over a wide 
range, increased drive leads to increased response strength. The contrary, 
contiguity position adopted in this experiment was the reinstating cues 
(food) associated with responding in conditioning would increase response 
strength. Thus, by this token, increasing body weight (decreasing drive) 
by pre-feeding prior to extinction test would provide more of the stimu- 
lus compound associated with responding during previous conditioning and 
thereby generate more responses in the extinction test. In a crude sense 
we were trying to "prove" the Null Hypothesis associated with the drive- 
reduction position, i.e., show no difference. A lack of difference would, 
of course, favor the contiguity cue-reinstatement view. A difference 
favoring the lower-drive, prefed group would be gravy • The latter was the 
outcome as shown in Table 11 were the distribution statistics are presented. 

The matching correlation between responding in conditioning and the 
10 min. extinction test was .65 with no differential effects appearing 
across E and C conditions. Such intensity of relationship appreciably re- 
duced the standard error of the difference so that a one-tailed P appeared 
of .10 favoring the prefed group and the cue reinstatement hypothesis. 

That this effect is ''real" is demonstrated in the fact that a num- 
ber of other experiments yielded comparable results with some even more 
striking. In one, for example, where pairing was achieved, eight prefed 
birds wxceeded their control partners. In these experimenters such vari- 
ables were introduced as amoung prefed, time las between pre-feeding and 



Table 11 



The horg d' oeuvre effect: The influence of pre -feeding on 10 min. 
of non-reinforced Skinner box responding in pigeons. 



No 

Hors d* oeuvre Hors d' oeuvre 



N 11 12 

X 96.O 68.0 

SD 69.8 59-2 
t 1.3 
P .10 
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test and sc ledule of reinforcement. 

Group matching is not a particularly common practice* Where one 
can group match, he can usually pair - a far more efficient technique . 
Furthermore, if behavior on the matching dimension does not correlate sub- 
stantially with the experimental behavior the procedure is" a waste of time. 
Also, differential relations between E and C must be considered. Sometimes 
matching is too much trouble, particularly where N is huge. On a few oc- 
casions, as the one cited, it's worthwhile. 

MAGNITUDE : THE CASE OF THREE OR MORE GROUPS WITH ONE DIMENSION 
OF EXPERIMENTAL VARIATION : SINGLE CLASSIFICATION ANOVA 

The continuity between this situation and the case of two indepen- 
dent groups, between the t- and F-tests, has already been indicated. Stan- 
dard textbooks spell it out; it need not be stressed here. In many in- 
stances we are experimentally interested in a functional relationship be- 
tween degrees of treatment and behavior. Thus we employ three or more 
points of our experimental variation and corresponding groups. This pre- 
sents a situation appropriate to one-dimensional or single classification 
analysis of variance. The complexity of anova lies in the increased N 
and nothing else. Basically, it's nothing but an elaborated t-test in- 
volving a comparison of treatment differences across conditions with an 
overall estimate of S-to-S variability ("individual differences"), that 
is, a ratio of variation in means to variations across individuals. As 
will be indicated, there are easier ways than the traditional for accomp- 
lishing this. 



Before launching into a treatment of single-classification anova, 
a basic word of caution is needed. When significance is achieved the 
procedure does not indicate what aspects of the behavior or what con- 
ditions generated the significance. In other words, the outcome of the 
application of anova to data is an open-ended proposition. For a given 
level of significance of F, the functional relationship can be linear, 
exponential or parabolic. Anova doesn't "care". Additional tests of 
significance (as well as careful scrutiny as always) are called for to 
tease out the exact features of the data producing the significance. For- 
tunately, tests are available for detecting outlying means that help to 
pin down the significance, but it should be indicated that they are cum- 
bersome from an arithmetic standpoint. More will be siad on this point 
later. 

1. Rank anova . The best way to illustrate anova is by an example. 
Some actual data are presented in Table 12 that concern the gross bodily 
activity of rats in an open field at three different drive levels deter- 
mined by percent of satiated body weight. Gross movements were defined 
in terms of eight-inch square traversed and rearing responses. The over- 
all project dealt with the impact of novel, unfamiliar stimuli of varying 
intensities and characteristics on performance of gross and fine movements. 

The first item to be spotted (after noting the cleaor trend for gross 
movement to increase with drive) is the outlying case in the SQPjo group 
which tops all others in responses. The ensuing heterogeneity of variance 
poses real problems for classical ajiova and also for the Range Test con- 
sidered in the previous section. The classical F test can be applied to 



Table 12 

Number of gross mo v eme n ts (locomotion and rearing) emitted in 
5 min. by three groups of rats at different drive level. 

DRIVE LEVEL 



JL 80* 90* 100* 

1 15k Ilk 108 

2 172 217 127 

3 20U 87 97 

4 139 128 127 

5 181 145 103 

6 165 — 178 

7 138 



Mean 
Median 
Range 



164.6 
165.0 
66 



138.2 
128.0 
130 



123.3 
117.5 
81 
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the variances as the ratio of the larger to the smaller variance, but 
more appropriately the Hartley F-maximum test should be used. (It is 
treated in most standard statistics textbooks,) While its value only 
reaches the 10$ level, problems remain for anova procedures that deal 
with the raw hetereogeneous numbers . 

A simple way out is to transform the raw scores to ranks and apply 
the Kruskal-Wallis Rank Anova as spelled out in most current standard 
statistics texts. Note that ranking the data tends to minimize hetero- 
geneity of the numbers, it does not change their relative standing. This 
technique has the disadvantage along with the traditional anova, of allow- 
ing opportunity for considerable arithmetical error, but it is still the 
most appropriate procedure for the numbers at hand. The essence of this 
procedure is to pool all the numbers and rank them from, say, high to low, 
sum ranks by columns and substitute the sums of ranks into a formula which 
produces a number, treated as a Chi Square, that reveals whether the col- 
umn sums have pulled sufficiently apart to warrant rejection of the Null 
hypothesis o,' a common target or parent population. In this instance the 
overall P-value from the ranV* anova is .027. 

None of the anov* .o-ddures pinpoints what features of the data are 
generating the significance. In the current instance, inspection suggests 
the 80j& group to be deviant with the behavior of the other two groups tail- 
ing off in a curvilinear, asymptotic fashion. The data are probably too 
crude to warrant it N re refined statistical treatment. The point is clear- 
ly made that higher drive tends to be associated with gx^eater gross bodily 
movement . 



Table 13. 

Rats' Skinner box extinction responses vita 
vat ion in extinction and the given hours of 
ditioning. 

(Finan, 19^) 
HOURS OF DEF RTyATIOy 
1 12 
28 29 
31.6 62.0 
25.0 57.5 
25. ^ 35.0 
100 170 



N 
Mean 
Median 
SD 

Estimated 
Range 



2k Tax. food depri- 
deprivation at con- 

JS COSDITIQWIHG 
2k kQ 

30 30 

53.8 k5.6 

kO.O kl.O 

kl.Q 18.2 

190 60 
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2. Classical Anova. To illustrate the use of classical anova and 
the Range Test, some data obtained by Finan (l9k0) are presented in Table 
13. Before turning to the analysis, let's consider this experiment from 
a design and behavioral standpoint. In essence, what Finan did was con- 
dition four groups of rats in a Skinner Bex at 1, 12, 2h and k8 hours of 
food deprivation. All were then extinguished at 2^ hours of deprivation. 
This set-up becomes an incomplete block design where the complete design 
would have all four deprivation values represented in extinction as well 
as in conditioning. The absence of complete information tnus limits the 
inferences that can be drawn. 

Given the set-up as it js, certain a priori considerations apply • 
The fact of the matter is that drive was changed from conditioning to 
extinction for three of the four groups and not changed for the rorth. 
The principle of generalization and its correlary of generalization decre- 
ment clearly apply: The greater the change in the stimulus conditions, 
the greater the behavioral decrement. On the face of it the groups with 
the greatest change in drive should show the greatest response decrement - 
and they do. The 1 and h8 hour groups are below the level of the 12 and 
2h. The situation is complicated by some special drive manipulations Finan 
employed and even more by the fact, shown in the data of Table 12, that 
higher drive leads to increased bodily activity which, in this instance 
could readily be channeled into the bar pressing response. The effect is 
there; the responses of the KQ hour group exceed those of the 1 hour group 
with both roughly equidistant from the 2k nour group in deprivation. All 
in all, the generalization position fits the data nicely except for the 
peak performance of the 12 hour group and this may well be sampling or 
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attributable to the special operations. 

The general i 7 vt ion principle provides a logical and legitimate basis 
for combining the 1- and U3-hour groups against the pooled 12 and 2h hour 
groups. Ardent and avid statisticians may throw up their hands and call 
this a sticky procedure, but behavior theory dictates it. The P-value 
for the t-test applied to these combined data is .008 suggesting the oper- 
ation of a systematic variable, namely, generalization and generalization 
decrement from conditioning to extinction on the drive dimension. In othe 
words, the less the drive change, the higher the level of extinction per- 
formance . 

Anova is basically a simple tnough cumbersome procedure. In essence 
the deviations or differences across means are compared with chance varia- 
tion as reflected in differences among individuals. The calculating pro- 
cedure follows directly: deviations of means around the grand mean are 
contracted with the tutax of individual deviations around means of col- 
umns or conditions. The exact calculating steps in deviation or raw score 
units are treated in all books considering anova and need not be detailed 
here . 

Since Finan presents means and standard deviations by conditions 
along with a >uot graph representing individual performance, the stage is 
set for the application of classical anova and the Range Test. Following 
through on the anova steps and dirregarding the potential heterogeneity 
of variance across conditions, yields a P-value of .009 that indicates, 
by all ordinary standards, enough divergence from chance to warrant re- 
jection of the Null hypothesis. The follow-up analysis by the t-test 
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supporting ohe generalization hypothesis concerning these data has al- 
ready been mentioned. The next step, as always, is follow -up experiment a 
tion. A number of such studies (Jenkins, 1955) supports the conclusion 
that drive change, like any other operational, experimental change, pro- 
duces response decrement, except for the point already noted that sub- 
stantial increases in drive lead to increases in gross bodily activity 
that may be channeled into the lecorded response so as to compensate for 
the change effect. 

It's obvious that the anova procedure applied to the two-group as 
well as the situation involving three or more groups. There are many oc- 
casions where it is profitable to pivot experimental findings from one 
investigation on control data gathered in another experimental setting 
using the principle of dual controls. In other cases one control group 
may be the pivot point for several experimental groups. In all instances 
by definition, replication is involved. Some pertinent data from an ex- 
periment on crowding the threshold with pigeons follow: 

" INTERNAL " " INTERNAL - 

S CONTROL CROWDING " EXTERNAL 

CROWDING 

1 1970 230 580 

2 2300 1090 10U0 

3 3800 2030 1^70 

In the "Internal' 1 procedure, after conditioning at 80$ of satiated 
body weight, pigeons were completely satiated and then their body weight 
then very gradually reduced to its original 80$ level while exposure to 
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the Skinner Boxes was continued* In the combined case of "Internal-Ex- 
ternal" Threshold Crowding, the same procedure was coupled with decreas- 
ing the illumination on the pecking window to a minimum and then gradual- 
ly reinstating the original illumination. The numbers represent extinc- 
tion responses Lfter the treatment. 

First, it is obvious that independent organisms had to be used in 
the three conditions and second, it is clear that matching could be em- 
ployed (and was, but will be ignored in this context.) In this apparent 
anova set-up, the impact of the treatment was large, i.e., crowding the 
threshold by either procedure cut extinction responding to less than half 
of that of the control * Only one Experimental S's responding exceeded 
the lower limit of the Control Ss. One might apply overall ar.ova to these 

numbers or the t-lest to the separate experiments but it's obvious regard - 
v 

less of statistical outcome that behavioral change has occurred. In pass- 
ing, it might be noted that only the matched t-test is applicable in the 
pairing case as N is too small for either the Binomial or the Rank T-Test. 

3. The Anova Range Test . Since Finan (19^0) presented a dot graph 
indicating individual responses, the range of performance in his four 
groups can be estimated and is shown in Table 13. The range in the means 
is a little over 30 responses, N is taken as 29, the mean of the ranges 
?n the samples is about 130 (despite a couple of outlying cases) and the 
Range ratio value approaches 7.0 with a P-value cf less than .01. Here 
as in the other cases, the Range Test is far easier to apply than the 
traditional tests and allows for considerably less computational error. 

Table 1^+ presents some data from an auditory deletion experiment 



Table ih 



Number of items in a message correctly reconstructed by college 
students as a function of percentage of the message deleted by 
auditory masking. Maximum correct is 30. 



PERCENT DBUSTED 

s log 20% i»o% soj, 6o% 70% 

1 27 21 18 8 7 7 

2 30 22 21 9 13 0 

3 28 2? 20 8 8 6 
If 27 25 13 5 0 2 
5 27 25 15 6 11 0 

Mean 27.8 23.6 YJ.k 7-2 7.8 3*0 



Mean % 92.7% 78.7% 58.0% 2k.Of, 26.0% 10.0% 

Retrieved 
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that is particularly amenable to the Range Test. In this experiment 
college students were given instructions to perform with a series of ob- 
jects placed in front of them. For separate groups, different propor- 
tions had been deleted by auditory masking. The score was the number 
correct of a possible 30 actions. The investigation had to do with the 
redundancy of the English language. 

Examination of Table ih reveals a larg>, clearcut trend: the more 
information deleted, the smaller the number of correct responses. There 
is a "whopper" effect with the extreme groups differing by a factor of 
five or more. One might apply classical anova to these results, but it 
seems like a lot of work when the Range Test will quickly and easily do 
the job. The range across means is roughly 25 units, N is 5 and the mean 
range in the samples is about 6.5. The resulting Range value is around 20 
with an associated P of considerably less than .01. Extremely high signifi- 
cance is demanded by the inspectional fact that adjacent distributions 
overlap only slightly except for the 50$ and 60$ conditions. Inspectional 
analysis pinned down by graphical representationwDuld seem sufficient 
analysis for these clearcut findings. 

A comparison of visual and auditory deletion may generalize the case. 
V/hereas in visual deletion of letters in printed material (Jenkins and 
Hosteller, 195*0 with 50$ deleted, nearly 90$ of the message was correct- 
ly reconstructed, here with 50$ masked by auditory stimulation less than 
one-quarter of the message was correctly retrieved. A level comparable to 
that of visual deletion was found here with only 10$ of the message de- 
stroyed . The discrepancy is consistent with the view that man is primarily 
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a visual organism. 

Cases could be multiplied ad nauseum illustrating single classifi- 
cation anova, but the point is clear that there are better ways than the 
traditional ones. The Range procedure is most appropriate so long as 
one looks for especially outlying cases. Transformation of the data to 
ranks helps and there seems to be no reason that the range procedure can't 
be applied to the ranks directly rather than wading through the cumber- 
some arithmetic of the rank technique. For example, when the gross bodily 
movement data of Table 12 ore transformed to ranks and the Range Test 
applied to the ranks, a value near the .05 level emerges. In all cases, 
of course, the more fcnnal analysis should support the trends visible in 
the data. 

After anova , what ? Multiple comparisons . As has been noted several 
times, the outcome of anova can indicate overall significance, but not 
pinpoint what particular, specific differences are generating this outcome 
The essence of demonstrating what a significant anova adds up to lies in 
teasing apart the means associated with the several conditions. This can 
clearly be accomplished by inspection of the means and variabilities, but 
most behavioral scientists require more quantitative evidence. A number 
of procedures are available (Ryan, 1959 )> and as such will be noted but 
not treated. Tukey ! s Layer Test is one of the better ones where outlying 
means are peeled off like layers of an onion. The t-test is sometimes 
used incorrectly. It was developed fo"r testing the hypothesis of zero 
difference between two and only two means. The distortion introduced when 
say, six means are compared and contrasted is apparently large unless a 
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directional hypothesis was set up on an a priori basis, i.e., mean A pre- 
dicted greater than B, B greater than C, and so forth. In this instance, 
however, Mosteller's Testing a Ranking (1950) procedure is far more ef- 
ficient than other so-called multiple comparisons since, if the means at- 
tain the predicted order, the operating hypothesis is accepted without 
further ado. Nothing could be simpler. The i*ub comes, however, when 
the prediction is made and the predicted order of means is not achieved 
within the limits of sampling variation. Then considerable experimental 
eflort has been wasted. In other words, a large wager is made for a big 
return, but a loss is also big. 

The basic problem in contrasting more than two means in an anova 
set-up is obtaining an overall estimate of error for any mean that re- 
flects the expected (and obtained) sampling variation in all of the means. 
Once this parameter is fixed (and one outlying case can create real prob- 
lems), the procedure is simply one of setting a significance level and 
determining if adjacent means - arranged in order of magnitude - differ 
enough to infer separate target populations. The arithmetic is a little 
lengthy, but ihe basic notion is straightforward. 

Across the board - and this comment applies to forthcoming s, 2tions 
as well as the present one - anova is a handy exploratory instrument where 
one is not certain what's going on with the numerical patient. It helps 
one infer overall significance, and, as such, is a systematic operation, 
but is no substitute for more precise or sensitive analytical tools. It 
is clearly no replacement for inspection since, as has been already noted, 
significance can accrue to anova when the relationship is linear, exponential 



or parabolic - and behaviorally it usually makes a great deal of differ- 
ence which it is. Anova, however, doesn't respond to the nature of 
functional relationships. One other minor objection to anova might be 
noted. It does not indicate (any more than the t-test or similar measures) 
the intensity of relationship (far less the direction) involved. Peters 
and Van Voorhis (19^0) (and others since) have proposed a generalized form 
of curvilinear correlation, Epsilon Squared, as a substitute for anova on 
the ground that it provides an index of relationship. There is clearly 
a point here, but the same objections of effort and error apply to this 
procedure as to anova. There is no substitute for visual scanning and 
graphical representation as the basic modes of determining the effects 
of an experimental treatment on behavior. 

MAGNITUDE : TWO " SIMULTANEOUS " DIMENSIONS OF 
EXPERIMENTAL VARIATION , DOUBLE CLASSIFICATION ANOVA 

This complicated phrase encompasses three related but disparite sit- 
uations : 

1. Three or more conditions of the experimental treatment with the 
same Ss rotated through the conditions (self-control procedure) or the 
use of Ss matched on some a priori basis; 

2. The case of "repeated measurements" or "trend analysis" where 
two independent groups are tested or measured several times over a series 
of trials or blocks of time; 

3. "Simple" factorial design where two experimental treatments are 
applied "simultaneously" to two or more groups each. 
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It might be noted, ad initio , that such matters as two "simultaneous" 
dimensions of variation, be they a correlational element and a treatment 
or two treatments, are complex matters from the standpoint of both statis- 
tics and arithmetic. From a design and experimenting view, they add only 
slight to moderate additional a priori and experimenting labor and may 
pay large dividends. It would seem that as design increases a bit in com- 
plexity statistics increase geometrically in difficulty. It might be added 
at this juncture that additional increases in design complexity, such as 
adding a third experimental treatment, also seems to increase interpre- 
tation of the resulting data geometrically. In addition it sets the stage 
for a major role to be played by one deviant case going against the grain 
of the group. More will be said on these matters in connection with com- 
plex anova. The point to keep in mind is that both statistical and inter- 
pretative effort increase greatly as treatments or variables are added. 

1. Correlated data . This situation is a variant on the single class- 
ification anova theme where the variable added is a correlation across 
rows by either using the same Ss rotated through the three or more con- 
ditions or Ss are matched on a beforehand basis and assigned in trios or 
larger sets to the several conditions. 

As 2 tour de force in another connection (Jenkins, 1966) I wrote up 
the following (hypothetical) example of translating everyday business into 
experimental action. 

The Whiff Test . This example stems from the hypnotic state induced 
by overexposure to TV ads. This attack on the deodorant problem is in- 
tended as a rough and ready paradigm for experimental designs dealing with 
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a comparison of advertised products. The steps spelled out apply equal- 
ly to detergents, soap, hair tonic, cars, cigarettes, toothpaste, razor 
blades, dog food, and the like. It might be noted in passing, that these 
problems are far from trivial in at least one sense: problem significance 
is met in that the veiy practical criterion of billions of dollars per 
year are involved. 

The first consideration is, of course, the experimental treatment. 
This is straightforward. The three deodorants leading in sales are se- 
lected for experimental examination. This is an objective and satisfactory 
criterion for inclusion. Advertising claims as to effectiveness can be 
ignored since they all amount to the same thing: vague and meaningless 
come-on. The several deodorants are to applied in equal amounts (or 
durations) or this property is to be varied systematically as part of the 
experimental treatment. Also built into the design at this point would 
be variation in time since bathing and nature of activity preceding appli- 
cation, e.g., social, physical or intellectual. 

The core of the design would be to rotate a small sample of Ss, say 
10, through all orders of presentation of the deodorants (including a 
"placebo" and a. "nothing" baseline condition) several times, applying a 
test for odor (The Whiff Test) each time these steps all followed by a 
replication with 10 more S s . Subjects should be roughly representative 
of the target population of deodorant users in age, sex, frequency of use, 
shaving of auxiliaries, etc. A sub-sample of non-users might add interest- 
ing information. 

The dependent variable of behavioral measure is slightly more com- 
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plicated. While a refined instrument such as Zuaardemaker * s Olfactometer 
could be used to measure odor as a supplement to the proposed test, the 
latter is simpler and requires no more than the human apparatus. The 
Whiff Test consists of having three judges without head colds, nasal ob- 
struction or other olfactory difficulties, approach S and sniff (or whiff) 
at systematic distances from him. Each judge would independently record 
"yes" or "no" for the presence or absence of odor. Any special features 
such as intensity or quality of odor would also be noted. Adaptation ef- 
fects for the judges should be controlled by interpolating periods of nasal 
inactivity. It is obviously preferable that S not know he is being judged. 
Information regarding the chemical nature oi the deodorants and the amount 
of perspiration generated by Ss under various conditions is of interest, 
but not the focal point of the investigation. 

Control procedures have already been stipulated for a number of 
sources of variation. By the self-control design, individual variations 
are minimized and sensitivity to the treatment maximized. The use of both 
a "placebo" and a "nothing 1 ' condition provides a baseline below which the 
suppressive effects of the deodorants can be assessed. Mode of presenta- 
tion, e.g., stick or spray will, of course, be held constant or varied 
systematically. Other considerations may include training the judges in 
olfactory discrimination and control of the odor of the deodorants them- 
selves . 

Since the culture seems to imbue large numbers of people with re- 
serve - if not fear and anxiety - about numbers and, particularly, about 
statistical manipulation of them, it seems appropriate to demonstrate the 
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potent ially simple nature of analysis for the types of numbers emerging 
from this investigation. Numbers, after all are simple and crisp; only 
people make them complicated. In any event, the following table presents 
a hypothetical listing of combined frequency of judges' "yeses". It 
should be noted that the magnitude of the entries would be much greater 
in actual experimental practice. 



SUBJECT DEODORANT 

A B _C 

1 3 2 1 

2 2 2 0 

3 5 2 1 

h 321 

5 3 3 2 

6 231 

7 320 

8 3 3 2 

9 3 3 2 
10 120 



First of all, the usual individual variations occur, but the main 
point is the consistently higher values for products A and B over C. 
(Note that "averages" are not needed and not presented.) In each compar- 
ison (A-C and 3-C), perfect consistency is achieved in this hypothetical 
case. Ten out of ten events by the binomial yields a P of less than 1 in 
1000. A comparison of A with B yields roughly a 50-50 split. Thus, pro- 
duct C is the "effective" deodorant of the three, remembering that the 
numbers represent the frequency of "can smells" by the judges. 

The classical anova procedure for correlated, self-control data such 
as these adds one arithmetical manipulation. Besides considering and com- 
puting variation across columns (treatment effects), the correlation is 
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taken into account by dealing with variations across rows. If the cor- 
relation is substantial, this variation will be large, and when partial- 
led out of the error variance, will leave the latter small thus enhanc- 
ing the significance level. If the correlation is less than substantial, 
the investigator may have wasted his tine in matching and in computing 
the correlational variation. It would be wise to inspect the data first. 

In the case of the Whiff Test, the numbers are small. and the spread 
so restricted that it hardly seems sensible to talk about correlation. 
Thus classical double classification anova for correlated data hardly 
seems applicable. The self-control design, however, paid off in that the 
simple binomial procedure allowed for rapid support of the inspectional 
analysis, namely, product C separated off for the judges from A and B. 

To stamp in the point about correlated anova, there follow some 
data from a drive experiment where four pigeons were exposed to aperiodi- 
caily reinforced responding at three different percentages of satiated 
body weight. The precautionary controls were, of course, exercised of 
using different orders of presentation of drive levels ror each bird, 
measuring several times at each level, stabilizing body weight before 
measurement and so forth. The numbers represent responses in 30 minutes 
divided by 100 and rounded for simplification. 



DRIVE LEVEL 



S 





95% 



17 
15 
2h 



15 

k 



10 

6 
11 

2 



o 



17 

6 
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Several items are immediately obvious in this table . Across the 
wide range of responding represented, all birds show a diminution in 
frequency o** response as drive is decreased. There is only one small re- 
versal and a suggestion of approach to an asymptote appears. Many things 
could be done t. <j these data statistically; little need be. Considerable 
correlation emerges in the data: birds starting high, stay high and vice 
versa. Double classification anova, teasing out the effects of drive 
(columns), self-control or correlation (rows) and error (remainder), yield 
significance supporting the obvious nature of the numbers. 

The computational steps for the traditional double classification 
anova for correlated data are presented in detail in standax^d textbooks 
and need not be spelled out here. It seems worthwhile, however, to refer 
back to J ne rank procedure for the correlated data set-up that was pre- 
sented in detail in the original Q, & D manuscript. The Friedman Rank 
Anov* is quite straightforward. Table 15 presents some dvta appropriate 
to it from an experiment on Thon:dike's "spread of effect" but without 
reward or learning (Sheffield, I9U9, Sheffield and Jenkins, 1952). Col- 
lege students simply wrote down several hundred numbers from 1 to 10 
"chance" repetitions were lined up on the answer sheets and the percentage 
of repetition following these chance repeats was calculated. 

In the Friedman procedure, the ranking takes place S by S across 
rows. The ranks are then summed b v columns and substituted in a formula 
that yields a Chi Square - like numoer. The question being asked is 
whether the sum of ranks by columns pull far enough apart to warrant re- 
jection of the Null hypothesis where the correlation (and the design 



Table 1£ 

Percent repetition in & "spread of effect" set-up without reward 
or learning. 



CHAHCB POSITION AFTER CHARGE REPEAT 

S REPEATS 12 3 

1 299 26.1 11. U 11.7 

2 303 17.5 15. 3 13.1 

3 276 20.3 10.2 10.6 

k 289 19.^ ii.7 10.0 

5 318 20.4 13.5 11.9 

6 318 24.5 13.2 13.2 



matching) is considered by rankinc across rows. It should be obvious 
that if every S is, say, highest under a particular condition, the sum 
of ranks for that condition will diverge from the others. A^ain, out- 
lying scores are corrected for, at least in part. 

Perusal of Table 15 indicates a clear sloughing off of Position i 
from behavior at the other two positions. (Note that chance in writing 
down the numbers 1 to 10 is 10% and that behavior at Position 1 exceeds 
this value by roughly a factor of two.) In all six cases percent repeti- 
tion is higher at Position 1 than at either Positions 2 or 3 from a chance 
repeat. The binomial gives a probability of l.'6U for these two sets of 
events. The rank analysis of variance for correlated data yields a re- 
sult consistent with the binomial scanning analysis, namely, a Chi Square 
of 9.1 and a P -value of .01. 

Over and beyond any manipulation of the numbers, the important find- 
ing in this experiment is the occurrence of the "spread-of -effect" pheno- 
menon in a setting where neither reward nor learning was operating. Since 
Thorndike labelled his original paper on the "spread-of-ef feet " , "A proof 
of the lav; of effect data such as these "disprove" his proof and cast 
deep doubt b on the formulation of the lav; of effect. As usual, a far 
simpler contiguity principle was operating to generate the findings, name- 
ly, the number guessing habit sequences that Ss bring to the experimental 
situation so that when one number is anchored (in this instance on a chance 
basis), the several numbers associated with it in sequence follow. Evi- 
dence against the Law of Effect has been accumulating since before its 
inception. This tyj^: uY result adds to the pile. 
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Classical anova could be applied to these data, but it hardly 
seems worth the effort in the light of the outcomes of the easier ana- 
lyses. It must yield a significant result in view of the orderliness of 
the findings. 

2. Repeated Measurements . It is a very common occurrence in be- 
havioral research for the reactions of an organism to be recorded over 
a series of trials or in several blocks of time* For instance, the ex- 
tinction curve deriving from the behavior of a rat in a Skinner box may 
well be divided into time portions. Or the latency or running time of a 
rat in a runway may be plotted trial-by-trial. Typically, in these sit- 
uations an experimental treatment is applied to one or more groups and 
a control treatment to others with repeated measurements being taken for 
both groups. Vie are interested in the action of our experimental treat- 
ment, changes in behavior over time or trials and the interaction of the 
two, that is, systematic, differential changes in one group as contrasted 
with the other as time or trials go on. Certain experimental operations 
may contribute to the retardation or facilitation of acquisition or ex- 
tinction. The effects emerge as we contrast an experimental with a con- 
trol group over a series of trials or blocks of time. In extinction, for 
instance , a given procedure may result in retardation of the last half 
of extinction with little or no impact on behavior in the first half of 
extinction. This section is concerned with these trend matters. It 
might be noted that the "repeated measurement" set-up is essentially an 
extension of double classification anova. The same Ss are repeatedly 
tested, some under one set of conditions and other Ss under other condi- 
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tions, ao that a correlational component is involved within each of the 
conditions. 

At this juncture the reader should again be cautioned that the cri- 
terion for assessing data is adamant: If one can't see the effect in the 
numbers, it's very likely not there. 

To start with a complex example and work back to the simple, Table 
16 contains some results from a generalization-drive experiment with 
pigeons (Jenkins et_ al, 1958). After stabilization of responding on an 
APR schedule with one group at 90$ of satiated body weight and the other 
at 70%, the size of the illuminated spot on the pecking window was varied 
systematically during brief extinction-generalization tests. These were 
repeated a number of times. Stabilized responding was used as the base- 
line to convert test responses to percentages to cut back on individual 
variability. The bird -by-bird data are contained in Table l6. 

One could whip this series of numbers to a pulp statistically and 
squeeze nothing more from them than meets the naked eye. First things 
first: the individual generalization functions of the two birds nearest 
the median of their respective groups in training are plotted graphically 
in Fig. 1. From these two representations and without recourse to any 
statistical manipulation, it is obvious that several differential behav- 
ioral events have occurred. First, except for a couple of minor reversals 
all birds exhibited consistent generalization decrement functions: as 
stimulus dissimilarity from the standard increased, responding decreased - 
the usual finding in this setting. Next, drive had an appreciable effect 
on respond i.np; with appreciably hifthfu* porcmiLages appearing for the high- 



Generalization as a function of drive level in percentage terms. 

(Jenkins et al, 1958) 
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drive (HD, 70%) group than for the low-drive (LD, 90$) condition. Again, 
this is a common finding that, in an already conditioned piece of behav- 
ior, greater deprivation generates increased responding. The third, and 
most important effect, is the interaction of drive and stimulus change. 
Interaction means, of course, simply that behavioral changes associated 
with one experimental dimension of- variation vary differentially with tt£ 
application of some other treatment. Behavior changes as a joint function 
of the two dimensions, say, experimental and control. Exactly that hap- 
pened here. As stimulus dissimilarity increases, the two generalization 
decrement functions pull apart with the HD group showing a flattening out 
and much less decrement while the LD continues to drop off with increased 
dissimi larity * 

Across the boaxd, careful study of Ta.ble l6 and Fig. 1 clearly sup- 
port these inferences. Journal editors, however, require more elegant 
statistical manipulation. If these are applied, the three sources of be- 
havioral variation tun A out significant: drive, spot size and the inter- 
action of the two. Such elaborate trend procedures may satisfy editors 
and those who are compulsive about their statistical analysis, but they 
can be frustrating and time-consuming for the behavioral scientist who 
can see the effects clearly in the data and wants to get on about his 
experimental business. However, this presentation is a dialogue not a 
diatribe. 

A nice example of an apparent contradiction between simple, in- 
spectional-type statistics and a more elaborate, complicated procedure 
derives from the data of Table 3.7 . The numbers are extinction responses 



Pigeons' extinction i?sponees in 20 min. periods with massed and 
distributed extinction. 

20 MINUTE PERIODS 



S I 2 3 Total 

1 48 35 16 99 

2 27 6 13 1*6 

3 92 29 10 131 
@ 4 171 31 9 211 
§ 5 57 43 18 118 
g 6 56 17 4 77 
g 7 67 24 7 98 
g 8 13 6 0 24 
ft 9 4 10 0 14 

10 kk 38 22 104 



Mean 23.9 9-9 92.2 

Median 52. 0 26.5 9.5 93.5 



1 lk 8 4o 62 

2 1C5 23 38 166 

3 71 92 9 172 
k 131 123 28 282 

5 31 0 31*- 65 

6 1*6 1 26 73 
g 7 49 5 23 77 
$ 3 105 48 62 215 
S 9 54 5 23 82 

10 110 16 22 i48 



Mean 71.6 32.1 3°. 5 134.2 

Median 62.5 12.0 27.0 115.0 
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for two independent groups of pigeons conditioned on a 100$ reinforce- 
ment schedule with the major treatment being distribution of practice. 
One group was conditioned and extinguished in a series of brief sessions 
while the other was exposed to continuous conditioning and extinction 
without a break. The theoretical reasoning behind this experiment is 
quite straightforward . One position regarding extinction is that it con- 
stitutes a passive, decay process. The contiguity view, on the other 
hand, holds that extinction is a form of learning where, under conditions 
of radical stimulus change (particularly after 100$ reinforcement), a 
new habit is acquired in the presence of a major portion of \hc original 
stimulus compound. In the case of pigeons operating in a Skinner Box, 
the new habit consists of doing something (usually not recorded) other 
than pecking the illuminated window. Since distribution of practice 
facilitates learning, and if extinction is learning, the latter should 
be speeded up by distributing extinction trials or sessions. Thus a 
crisp counter-opposing test of the two views of extinction are provided 
by this type of experiment. (Virginia Sheffield (1950) performed the 
classical investigation in this area.) 

The numbers contained in Table 17 are a little complicated, but the 
trends ure clear from inspection. All 10 birds in each condition show 
the decrement in behavior associated with extinction operations. There 
is considerable intra- and inter-group variability, but the data suggest 
a clear trend in the direction of the hypothesis of more rapid extinction 
for the group treated with distributed extinction. Or in other words, 
this group shows faster acquisition of some habit other than pecking the 



window - whatever it may be. Furthermore, it seems as if the groups 
start quite close together early in extinction and pull apart in the last 
20 minute period. As a matter of fact theine is only one case in the massed 
group whose responses get onto the distributed distribution in the last 
20 monutes of extinction, (it is noteworthy that 8 of the 10 massed ex- 
tinction birds increase responding from the second to the third 20 minute 
period.) From these not so casual inspections, it would then appear that 
a case may be made for the significant action of l) distribution of ex- 
tinction practice, 2) extinction sessions and 3) interaction between the 
two with the functions pulling apart over sessions. 

Classical statistics do not agree with these interpretations. The 
traditional trend analysis for repeated measurements shows that only ex- 
tinction per se is significant, a point that is obvious in that all 20 
birds showed decremental effects over sessions. 

These contradictions need to be resolved. If one accepts the con- 
clusions available from the classical analysis, a good deal of informa- 
tion is overlooked and the findings are equivocal with regard to the hy- 
pothesis entertained at the outset. It seems wasteful to follow this pro- 
cedure and disregard some striking trends in the data. As an initial probe, 
the overall repeated measurement analysis may be useful, but it appears 
quite insensitive to the actual behavioral changes occurring. Thus we 
must resort to other techniques if we are to salvage a test of the hypo- 
thesis - and this seems a highly worthwhile step. Several things may be 
done to the data. For one thing, conversion of the raw numbers to ranks 
helps a. little in cutting back on the appreciable variability, but even 



under these conditions, usual statistical significance is rot achieved 
for the basic treatment variable of distribution of extinction practice. 
Another way to go is simply to analyze the data by fragments - even in 
the teeth of the objections that can be raised to piecemeal statistical 
treatment. For instance, a classical t-test applied to the extinction 
responses in the last 20 minute period yields a highly significant P- 
value, as it must from the almost non-overlapping nature of the distri- 
butions. But this procedure still leaves the situation somewhat openended. 
It could be argued, for example, that the distributed group (for whatever 
(chance) reason) started lower (but not significantly) in performance in 
extinction and ended up lover simply because of the built-in behavioral 
correlation. This is a possibility that must be considered. The obvious 
procedure is to convert extinction responses in the last 20 minutes to 
percentages, bird-by-bird, of the first 20 minutes of extinction. The 
ir'-d ian decrement in the distributed group was a.bout 90% while in the massed 
group, it v/as only h9%. The corresponding means w^re 80% and 20£. The 
variability in the percentages v/as quite large so that a Mann-Whitney- 
Wilcoxon Rank T-test was employed. It yielded a P -value of .01. Split- 
ting the percentages on the grand mean, the P- value associated with the 
Fisher-Yates Exact Test was .00? with a Phi Coefficient of .50. 

All these additional rather detailed analyses in support of inspec- 
tion prove out what one sees in the data, namely, a large and significant 
difference in the third 20 minutes of extinction with the distributed 
birds losing the old behavior and acquiring zhe new more rapidly than 
the birds exposed to the masr.ed extinction procedure. Since differences 



were small and insignificant in the first 20 minutes of extinction, this 
significant finding clearly indicates a pulling apart of the behaviors as 
a joint function of distribution of practice and extinction sessions, i.e. 
suggests a clearcut interaction effect. It hardly seems necessary to go 
farther with statistics in this instance. It does, however, provide a 
nice example of a basic caution: caveat emptor when the gifts are those 
of traditional statistical analysis applied to behavioral data with all 
its vagaries and explications. Do not buy a statistical pig in a poke. 

To end this section jitb a relatively simple example, Table 18 was 
constructed. It consisxs of extinction responses - coded, rounded and 
simplified - from an experiment or reinforcement theory in which a brief 
flash of light was thrown on the pecking window during pigeons' extinc- 
tion as a substitute for the presentation of food during prior aperiodi- 
cally reinforced responding. This increase- in stimulation (as well as 
change) should serve as a reinforcing agent by the contiguity position 
an} contrary to the drive-reduction view. Its reinforcing property lies 
in its ability to change behavior, to bring about momentary pauses as 
changes in the pigeons' behavior and thereby maintain the behavior abeve 
the extinction level of a control group without this light-up treatment. 

The apshot of Table 18 is straightforward. Behavior was maintained 
by the light-up although decremental extinction effects appeared in the 
behavior of both groups. The behavior started at about the same level in 
the first half of extinction and pulled apart in the second half. Six out 
of six birds shov/ed decrement in behavior (P of .016 by the binomial) and 
the three light-up biris exceeded the three controls in the second half 



Table 18 



Pigeon extinction responses coded and rounded, with and without 
brief periods of increased illumination substituted for food. 



S 1st Half 2nd Half 

1 8 5 

Light -Up 2 'J k 

3 6 3 

l 8 2 

No Light-Up 2 7 1 



of extinction (P of .05 by the Arrangement Technique), It follows that, 
with two groups starting at the same point and ending up at different 
points with significance accruing to time and treatment differences, in- 
teraction between the two dimensions is also significant. The proof lies 
not in the classical statistical pudding, but it may be employed as a 
supplementary procedure. All conclusions from the "rough and ready" ana- 
lyses are supported by the more traditional approach: significance emerges 
for the three sources of experimental treatment, extinction over sessions 
and the interaction of the two. It helps »hen the classical, f^r more 
cumbersome procedure generates results consistent with the quick and dirty 
ones. Inspection again pays off. 

3- " Simple " Factorial Design . "Simple" factorial design is fairly 
straightforward; "simple" factorial analysis of the outcome of the de- 
sign is far from simple. In the former instance, factorial design in- 
volves the "simultaneous" application of two dimensions of experimental 
variation. It is clearly not "simultaneous" because independent groups 
of Ss are involved. In the simplest case there are two experimental treat- 
ments with just two conditions each, making up a two-by-two layout of four 
cells in all. The generalized case or prototype is this: 



VARIABLE A 

CONDITION 1 CONDITION 2 



CONDITION 1 






CONDITION 2 
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Thus the four independent groups constituting the cells of this set- 
up are: Al, A2, Bi and B2. How many cases are treated in each cell and 
other considerations are a function of the nature of the problem, the ex- 
perimental treatments, the behavioral measurements involved, and the like. 
Analytically, the procedure consists of teasing out the effects of Vari- 
able A separately, those of Variable B by itself without regard to A and 
finally the joint action or interaction of the two dimensions of variation 
simultaneously, namely, the diagonal cells A1B1 plus A2B2 versus A1B2 plus 
A2B1. Probably the simplest paradigm for remembering the factorial lay- 
out is a stimulus change or generalization experiment where one dimension 
is degree of stimulus dissimilarity from the originally conditioned stim- 
ulus and the other experimental treatment constitutes degrees of drive, 
partial reinforcement, distribution of practice and so forth. The upper 
left hand cell combined with the lower right hand cell constitute the con- 
ait ions of no stimulus change; the other diagonal cells are the cnos treated 
with change. This point will be spelled out below. 

This is as good a place as any tc pinpoint the nature of the inter- 
diction source of variation in behavior particularly and in statistics 
secondarily. We will consider only the simple interaction case where there 
are two treatments and a single interaction. More complex interactions of 
three of more variables may be comprehensible to sophisticated mathemati- 
cians in a statistical sense, but their behavioral meaning appears * rapid- 
ly fade away. 

"Interaction" is relatively easy to describe in behavioral terms. The 
question is: do the responses cnange differentially with the apjlicat '. -n 
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<*f both variables; are they a joint function of the two treatments? Put 
more simply, is behavior different under one set of experimental values 
than under another? Does behavioral change' hinge on the combined action 
of the two experimental variations? Examples may help clarify the matter. 

There are four major combinations of events with two treatments. 
(There are several others, but they are minor for our purposes.) They are: 

A. Where only one of the experimental treatments has an impact on 
behavior ; 

B. Where both treatments influence behavior a a large scale leav- 
ing little behavioral variation left over for interaction effects 

C. Where interaction accounts for most of tLc behavioral variance 
with little remainder for the two experimental treatments; 

D. And where all three primary sources of variation - the two basic 
variables and the interaction - have a big impact on behavior. 

Each of these cases will be considered in turn. 

CASE A: The Operation of One Variable. The accompanying chart shows 
In numbers and graphically what happens in the hypothetical case where one 
value inf luenc 3S behavior in the two-by-two set up and little impact is 
exerted by the other variable or by the joint action of the two variables 
(interaction). Here as values of Variable A increase, behavior increases 
reg3rdless of whether condition Bl or B2 is involved. The two functions, 
so to speak, go up together. The marginal sums in the tabular material are 
the Key to inspectional analysis. These reveal an increase by more than a 
factor of two as we go from "onuition Al to Condition A2. No difference 
emerges between Conditions Bl and B2 and little between the diagonal cells 



CASE A: THE OPERATION OF ONE VARIABLE 
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(A1B1 plus A2B2 versus A2BI plus A1B2). Analysis of the data by a stan- 
dard t-test would reveal a highly significant difference between Conditions 
Al and A2 without regard to variable B. As a matter of fact the Al and 
A2 distributions do not overlap. The more elaborate interactive analysis 
reveals essentially the same outcome. Significance accx . to Variable 
A but to neither Variable B nor the interaction of Variables A and B. 

This kind of finding might nerge from a "perceptual" experiment in 
which "Levelers" and "Sharpeners", perceptually defined, were selected to 
constitute Variable B and Variable A consisted of success or failure in 
learning or problem solving. The success-failure dimension influences 
behavior, but not the perceptual variable. 

CAS E B: The Operation of Both Variables . The chart presents the 
data for the case where both variables have a large and significant impact 
on behavior to such an extent that little behavioral variation is left 
over for the interaction of the two dimensions of variation. Again, the 
effects are clear in the marginal totals with the situation rigged so that 
the diagonal cells end up with the same sums. From these marginals it is 
also apparent that Variable B has a larger behavioral effect than Variable 
A, but that both operate on an appreciable scale. Instances where this 
kind of finding emerges are fairly common in behavioral research. An ex- 
ample that immediately comes to mind is the experimental case where partial 
reinforcement and cue change are applied "simultaneously". In this hypo- 
thetical case we have two degrees of stimulus change (Variable A), a con- 
trol condition of "no change" and one degree of fairly marked charge. Var- 
iable B is the reinforcement schedule and, in a typical experiment of this 



CASS B; THE OPERATION OF BOKi VARIABLES 
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nature with human or infra-human organisms, the two conditions would 
most likely be 100$ and 50$ reinforcement. The hypothetical findings pre- 
dated are not too far off the mark of actual findings in real-life ex- 
perimental settings ( viz Rickard , 1959). Of passing interest is the fact 
that a journal editor once turned down a paper containing this type of 
finding on the grounds that the interaction had to be significant. It's 
obvious, however, that if the two major sources of variation have a 
"whopper" impact on behavior as in this instance, there cannot be much 
behavior left over for interaction. Possibly a formal academic course in 
inspect ional analysis is called for. In any event, treatment of these 
data by any appropriate analysis supports what can be seen: the two var- 
: ^bles have a significant influence on behavior. If one wished to ana- 
lyze the data without recourse to the elaborate procedures, inspection 
reveals non-overlapping distributions for both sub-groups along both ex- 
perimental dimensions. In all instances, three events exceeding three 
others yields a probability of .05 by the Arrangement Technique. 

CASE C: The Operation of Interaction . There are some instances 
where behavioral change pivots on the joint action of two dimensions of 
variation. These are typically cases that have a behavioral impact when 
applied alone to one of the values of the other experimental treatment, 
but operate differentially when several values of the second variable are 
included. Such a hypothetical case is presented in the accompanying chart. 
The differential effects of Variable A on Variable B are immediately ob- 
vious. Behavior under Condition Bl decreases as A increases while under 
Condition B2 it increases with A. The numbers reflect this situation ir 
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showing the marginal totals to be about the same by rows and columns, but 
to differ appreciable on the diagonal sums where the interaction effect 
operates. Behavior under Conditions Bl and B2 pull apart and, as a matter 
rf fact fail to overlap, but in opposite directions for the two values Al 
and A2. This is clear interaction. 

Analysis of these data will show the interaction term to be highly 
significant while, across the board, neither Variable A nor B has a signif 
icant effect. There is no contradiction here. Taken alone A has a clear 
effect on and taken alone it also has a clear effect on B2. Bit tak^n 
together the effects of A on both Bl and B2 cancels out. It seems clear 
that averaging the curves in the figure at the two sets of points will 
yield no change and zero slope. Given one value of A, behavior under B 
will differ, depending on whether the condixion is Bl or B2. Behavior 
thus depends on both variables; to specify it one must know the values of 
both A and B. Behavior covaries jointly with the action of both dimensions 
of variation. 

An actual experimental example may help stamp in the point. Find- 
ings such as those in Case C emerge in studies of rats' behavior in open 
fields. With drive (food deprivation) as a primary variable, behaviors 
classified as Gross Movements (GM consisting of locomotion and rearing re- 
sponses) and Fine Movements (FM involving washing, grooming, scratching, 
sniffing and the like) operate quite differently. As drive increases, GM 
increase and FM decrease. In other words, the rats under high drive spend 
a good deal of their time running around and rearing up on their hind legs. 
With appreciably lower drive levels, FM increase markedly in frequency with 



the rats sitting around grooming rather than locomoting or rearing. Thus 
in the accompanying representations, Variable A constitutes drive, while 
B2 consists of Gross Movements, and Bl of Fine Movements. The two sets 
of behaviors operate ii: ? diametrically opposed direction as a function 
of the drive variable. As drive increases, one set of behavior increases 
while the othei decreases. As drive decreases, the converse case holds. 

CASE D: All Three Sources of Variation 0p< rating . Finally there 
is the situe + ion vrhere both basic variables influence behavior along with 
their joint action. The accompanying tabular and graphical representations 
depict this state of affairs. It can be seen that Bl and B2 pull apart as 
one goes from Al to A2. This is clearly the most striking feature of the 
representations. This differential reflects the decrement in behavior in 
Condition B2 as contrasted to the lack of change in Bl proceeding from Al 
to A2. The situation is shown in the marginal totals where a clearcut 
differentail emerges for both Variables A and B as well as for the diag- 
onals. Analysis of these numbers by zhe traditional procedures yields 
significance for all three sources of variation. 

A case somewhat akin to this has already been cited in the repeated 
measurement section where the example was given of distribution of extinc- 
tion generating retarded decremental effects for the massed group and/or 
facilitated decremental effects for the distributed extinction condition. 
It will be recalled that classical statistics did not uncover a signifi- 
cant interaction term that was visible in the data, but that a subcompari- 
son of the latter part of extinction strongly supported a differential pull- 
ing apart of the massed and distributed curves and, thereby, sn appreciable 
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interaction eiTect. 

Another case in point is an experiment by Rowe (1955) in which a 
tc^t of the Skaggs -Robinson hypothesis was conducted with the rearing re- 
sponse cf rats in an open field. Hand removal reinforcement for rearing 
was employed and an extinction test run with total number of rears counted. 
Two basic conditions were involved: Generalization Decrement and Skaggs- 
Robinoon. In both groups the rearing response was built in by removing 
the rats by har^ when they were in a full rear in a particular open field 
(A), Two additional open fields were instructed differing in shape, height, 
illumination, texture and color of walls and floor, and the like, one (C) 
quite different from Field A and the other (Field B) judged to be midway 
between A and C. I\ one experiment dealing with Generalization Decrement 
(GD), the rats were conditioned in Field A and one-third tested in A, B 
and C. In the Skaggs -Robinson condition all rats were similarly trained 
in Field A and, in addition, one-third were given additional hand rein- 
forced training in Fields A, B and C. All the latter rats were then re- 
turned to Field A for their f ree-resnonding extinction test. 

Theory and previous data clearly suggest straight decremental effects 
for the GD rats as dissimilarity increases from Field A through B to C. 
The Skaggs-Robinson hypothesis suggests that, as dissimilarity of inter- 
polated learning increases, interference effects increase at first and 
then decrease as dissimilarity becomes maximal. In other words, additional 
training on the same material contributes to over-learning. Learning of 
somewhat different materials interferes with retention of the originally 
learned responses; and when dissimilarity is at the limit, little or no 
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interference with original retention occurs sjnce the tv/o sets of stimuli 
and responses have little to do with one another. Thus the prediction 
from this position is lor a U-shaped function with retention maximal (and 
interference minimal) at the extremes of the dissimilarity continuum and 
retention interferred with the most in the middle range of dissimilarity. 
Further, the U-shaped function should be asymmetrical with continued prac- 
tice on the originally learned materials producing maximum retention above 
the level attained with interpolated practice on quite dissimilar materials 
at the other extreme . 

The results, given below, supoort both the GD and Skaggs-Robinson 
hypotheses. The mmbers represent rearing responses of the median rat in 
each group in a 10 minute extinction test period. 

GENERALIZATION DECREMENT 

AA AB AC 

69 6k 58 

SKAGGS - ROBINSON 

AAA ABA ACA 

77 50 67 

Both the GD and Skaggs-Robinson functions emerge clearly in these 
data.. The GD finding declines in an orderly fashion while an asymmetrical 
parabola appears in the SIl tggs -Robinson data. Replication supported these 
findings (Rowe, ^955 )< 

Inspection indicates essentially no difference between the two con- 



ditions where AA and AAA are compared. At the other two points (AB vs 
ABA and AC vs ACA) significant differences are suggested. It is to be 
noted unat they are in opposite directions with the GD higher at the mid- 
point and the Skaggs -Robinson higher in the extreme change condition. In 
passing it might be noted that there was practically no overlap between 
the AA and AC performance in the GD instance and between AAA and ACA or 
the one hand and ACA on the other in the Skaggs -Robinson case. 

A fair amount of variability characterized these data and the over- 
all factorial analysis merely suggests significance for the two primary 
sources of variation and their interaction. In the final replicated find- 
ings a higher level of significance was achieved. At the least the find- 
ings are highly suggestive and indicate what behavioral changes can be 
achieved with small Ks and variables with large experimental effects. They 
further underscore the need for inspect onal analysis and non- necessity 
of elaborate statistical analysis. 

Experimental Examples of Factoiial Design . A couple of actual 
examples from the laboratory may help tie down these several complicated 
points concerning interaction. Table 19 contains some data from the per- 
formance of rats in an open field. Half the rats were given one-trial 
conditioning with hand-iemova] reinforcement in a field with cues mini- 
mised (small number of cues . S) and the other half the same treatment in 
a field with a large number of cues (L). Half of each group was then given 
a one-trial test in the same field and the other half in the different 
field. Latency of the rearing response was the iadex of behavior and the 
3ntries in Table 19 ren. esent difference scores between latency of the 



Difference scores in sec. between one training and one test trial 
for hand removal reinforcement of the rearing responses in four 
groups of rats exposed to a small (S) r umber of cues or large (L)* 
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training and testing rears. This is admittedly a pretty small chunk of 
pach S's behavior, but it will do for the case at hand. 

Immediately noticeable in Table 19 is that most Ss showed a decrease 
in latency from training to test indicating that one hand-remova:. rei* force 
merit shapes behavior. The other immediately ^parent item is that the vari- 
able having whopper effects was cue change from training to test. In other 
words, the large reductions in latency came in the groups where stimulus 
conditions were not changed while the small gains in latency accrued to 
the change from a small number of cues to a large number or vice versa. It 
is also obvious that cues at the tiite of test per se had very little effect 
on behavior and cues in training only slightly irore. The "real" effect 
is clearly the impact of change in cues or lack of it from training to te. i 
In line with a huge number of generalization and generalization decrement 
studies, this experiment showc behavioral decrement associated with stim- 
ulus change. 

The simplest analysis is, of course, a direct comparison of the be- 
havior cf all Ss treated with cue change with the responses of those hav- 
ing con? -ant stimulation. There is overlap by only one case in the two 
dirtribv ions. So by any statistical token the two s^s of behaviors are 
from di: erent parent populations. For the present purpose, the factorial 
analysis needs doing. The results of this procedure are completely con- 
sistent * ith inspection in revealing very high significance for the inter- 
action v change) between the training and test treatments. (Note Case C 
previously discussed.) Neither cues on training nor test show anything 
to speak of. As mentioned previously, this procedure seems like a lot of 
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work for the returns involved, but is probably worth it for didactic pur- 
poses . 

Table 20 contains some data from a quite different experimental set- 
ting where the influence of partial reinforcement and success or failure 
in anagram solution was tried out. College students were first condition- 
ed to emit a certain class of words with verbal reinforcement. Half were 
conditioned on a 100$ and half on a 50$ reinforcement schedule. The i> 
inforcement groups were then further subdivided for anagram problemj . 
Half of each subgroup was given insoluable anagrams with one letter changed 
so that a word could not be constructed (Failure). The other half had the 
solvable anagrams (Success). The final phase of the experiment consisted 
of conducting extinction for the originally conditioned word class. Thus 
the investigation was designed to test the effects of reinforcement sched- 
ule, success or failure and the joint action of the two. Table 20 contains 
a selected protion of the data chosen to represent the major trends of the 
original numbers . 

A quick look at this table indicates a clear trend in the data. It 
is quite apparent that the experience of success or failure with the ana- 
grams had little impact on beha^ J <r. It is also obvious that the joint 
action of the two variables had \ery little effect. The large effect is 
associated with reinforcement schedule. Only one case in the 100$ group 
gets onto the 5^ distribution indicating a far greater resistance to ex- 
tinction after partial than 100$ reinforcement. Either way the data axe 
sliced - the simple Arrangement Technique or the complex factorial ana- 
lysis - the outcome is significant for the reinforcement variable alone. 



Table 20 

Number of words emitted in extinction after 50f> and 100?. rein- 
forcement in conditioning and after, success or failure in solv- 
ing anagrams. 

REINFORCEMENT 

SCHEDULE ANAGRAM SOLUTION 



Success Failure 

13 15 

5Q& 16 12 

15 H 

9 8 

loo% 5 io 

12 k 



5 . Factorial Design : More Than Two Treatmen t Groups . Thus far 
we have considered only 'simple" factoiial designs where each of the two 
variables is broken down into only two conditions. From a design stand- 
point, of course, each dimension of variation may encompass three or more 
conditions For example, in studying generalization of the size concept 
in children, one might train three groups, one on animals, one on "pure" 
shapes and cae on toy vehicles. One third of each group would then be 
tested for g Mineralization and generalization decrement on each of the three 
types of objects. The focus would be on the extent to which size discrim- 
ination with one set of objects generalized or transferred - to the other 
objects. Or one might be interested in studying acquisition rate in chil- 
dren as a joint function of socio-economic status and age with several 
degrees of each variable represented. 

In an experimental example, Rickard (1959) studied the extinction 
behavior of college students as a joint function of reinforcement sched- 
ule in conditioning and degree of cue change in extinction. The results 
are contained in Table 21 where it is apparent that both dimensions of 
variation had a large and consistent impact on behavior. It seems obvious 
that the big decremental effect from no cue change (UC) to the other ex- 
treme (EC) and from infrequent partial reinforcement to more frequent is 
so great that little interaction of the two dimensions cculd emerge. On 
more thorough analysis this turns out to be precisely the case. A large 
chunk of the variance is taken out by cue change and another large amount 
by reinforcement schedule leaving very little behavioral variation for in- 
teraction. Again inspection pays off. One might conclude, even with the 



Table 21 



Extinction responses in college students as a joint function of 
reinforcement schedule in conditioning end degree of cue change 
in extinction. 



(Rickard, 1959) 
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relatively large number of numbers represented in Table 21, that inspec- 
tions! analysis cculd tell the whole story without any actual manipula- 
tion of the numbers. Such a procedure woold save a great deal of time, 
effort and frustration on the part of the analyst. 

To illustrate this type of design involving two dimensions of var- 
iation with three or more groups along one or both, an experiment dealing 
with the Freudian concept of "displacement" may be cited. Miller (19^8) 
has presented an impressive translation of the concept of displacement in- 
to stimulus-response terms. His argument goes that the approach tendency 
exhibits greater generalization than the avoidance in an approach-avoid- 
ance conflict setting. From this position he deduced that experimentally 
induced conflict would be followed by an increment in response strength 
when the stimulus situation was changed. Miller and Kraeling (1952) found, 
in line with this expectation, that comparable groups of rats ran more 
frequently in changed than unchanged alleys after conflict training. A 
series of studies with pigeons in Skinner boxes failed to yield this in- 
cremental effect (Brush, et. al. 1952). More recently, Murray and Berkun 
(1955) attempted an integration of Miller's conflict (1951) and displace- 
ment models, and tested their deductions using training procedures very 
similar to those of Miller and Kraeling. 

A re -examination of the Miller-Kraeling procedure seemed appropri- 
ate. Their rats were given approach training first, followed by avoid- 
ance conditioning. Thus, the terminal response learned was not-to-run. 
From the point of view of a contiguity theory, we can expect cue change 
to weaken the last response conditioned. If we consider the two mutually 
exclusive response classes of running anrl not-running, weakening of the 
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latter will produce an increment in the former. If this is the case, the 
Miller-Kraeling results can be accounted for by focusing on the effects 
of generalization decrement in the last response conditioned. 

It follows from this line of argument that conflict is not as es- 
sential to the "displacement" phenomenon as it would seem to be in Miller f s 
position. The effect should emerge under conditions of pure avoidance 
conditioning without the approach aspect. Lord and Taylor (1958) ran such 
a study in which groups of rats were trained in runways under conditions 
of l) approach, 2) avoidance, and 3) combined approach-avoidance (conflict). 
Two runways were employed differing in height, width, interior brightness, 
dividing lines and texture of the floor and ceiling characteristics. The 
approach group is, of course, the traditional generalization and generali- 
zation decrement condition. Half of the straight approach group was trained 
to run for food in one alley; the other half in the other. Half of each 
of the sub-groups was tested in the same (training) alley and the other 
half switched for test. No alley differences emerged. 

In the avoidance condition the rats were dropped into a padded bucket 
in the termiral unit of the runway. In the approach-avoidance condition, 
approach training was given first followed by the avoidance-drop treat- 
ment, following the Miller-Kraeling procedure. 

Before turning to the outcome, it needs to be repeated that the key 
group is the straight avoidance group if the contiguity reasoning is correct 
that the terminal response is crucial and if this group exhibits increased 
running on test ( changed -cue ) trials, conflict is not an essential ingredient 
for the generation of the phenomenon. 
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The results of the Lord and Taylor "displacement" experiment are 
summarized, S-by-S, in Teble 22 where focus on the "pure" avoidance group 
indicates a marked difference in behavior between the changed and un- 
changed conditions. In essence, the rats exposed to cnange, ran, while 
those remaining under their training stimulation did not. The contiguity 
argument proved out in the data and the case for conflict does not hold 
water. In passing, it is obvious that the straight generalization de- 
crement approach group behaved exactly in accord with expectation. 

The behavior of the conflict group is a focal point. In their case, 
running was first conditioned and then replaced by avoidant, non-running 
behavior. The expectation is, that under changed conditions, the last re- 
sponse trained (avoidance) should be weakened and replaced by the only 
other response (previously) conditioned, namely, running. In other words, 
under changed cue conditions the conflict rats should revert from avoid- 
ance to running. They did. The difference between behavior under the 
conflict-changed and avoidance-changed conditions is not great, but in 
the expected direction. Again the contiguity position is supported. 

The argument might arise that the avoidance condition is actually 
an approach-avoidance conflict because the rat brings an "exploratory" 
approach tendency to the experimental setting that is pitted against the 
avoidance conditioning. If this situation prevails, Miller f s position 
and ours reduce to the same thing. Our contiguity viewpoint still has the 
advantage, however, of reference to directly observable responses and stim- 
ulus changes rather than hypothetical entities. In any event, this is a 
tenuous argument. After all, any results ran be explained in an ad hoc 



Table 22 



Median running time in sec. to reach the end box in three two-min. 
generalization test trials after approach, approach-avoidance (con- 
flict) and avoidance conditioning. 

(Lord and Taylor, 1958) 



APPROACH CONFLICT AVOIDANCE 

UNCHANGED CHANGED UNCHARGED CHANGED UNCHANGED CHANGED 

5 k 360+ 360+ 322 118 

2 50 264 290 360+ 110 

1 19 360+ 67 360+ 360+ 

2 32 360+ 53 360+ 135 
5 8 36O+ 25 

12 



Median 2.0 15. 5 360+ 67 * 360+ 126.5 

P(t) .024 .016 .016 
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fashion post facto , but this procedure is not likely to advance behavioral 
science . 

Turning to the numbers of Table 22, this is a case where the inter- 
action of the two dimensions of variation is of no great consequence. We 
are interested directly in decremental effects in running in the "pure" 
approach condition and in decrements in not-running (avoidance) in the 
conflict and "pure" avoidance conditions. By this token, the pertinent, 
direct comparisons are made within and across groups. The main findings 
are clear without statistical manipulation: running decreases in the ap- 
proach group and increases in the other two groups. The results support 
the contiguity position and, minimally, raise serious doubts about the 
need for conflict in the occurence of the displacement phenomenon. In 
turn, of course, the theory behind the conflict position is called into 
grave question. 

Given the theoretical issues involved and the data of Table 22, it 
is left to the reader to decide whether traditional, interactional sta- 
tistics should be applied to the data. 

Another illustration of "simple" factorial design may be found in 
Newton* s (1953) investigation of the effects of reward and punishment on 
learning and tachistoscopic recognition. He somewhat unintentionally con- 
ducted a test of the Skaggs -Robinson Hypothesis previously treated. He 
had three groups of 20 college students each learn a list of five-letter 
meaningful words. One group was presented verbal and monetary rewards for 
correct responses; a second was verbally chastised and lost money for in- 
correct responses; and the third group constituted a baseline case receiv- 



ing neither reward nor punishment. Next he ran a tachistoscopic test 
in which he briefly flashed the original words singly along with words 
having one, two, three or four letters changed. For example, if the 
original word happened to be BASIN, it was presented along with BASIS, 
BARON, and BIRCH. On a generalization basis, it can be expected that the 
more similar the word to the originally learned one, the greater the prob- 
ability of the original response. Thus with one or two letters changed, 
intrusive errors of this kind should be maximal. With maximal dissimi- 
larity, on the other hand, discrimination should operate to appreciably 
reduce interference effects. There is practically no incompatibility 
tween BASIN and ALONE presented tachistoscopically. The origin* 1 , word 
should, of course, benefit from previous practice in the 'Acquisition set- 
ting. From these premises, it follows ( post facto) that a Skaggs-Robinson 
function should emerge as dissimilarity of stimulus materials increases. 
The following insert shows the mean number of errors in the tachistoscopic 
test as a joint function of learning condition and number of letters changed: 

LEARNING NUMBER OF LETTERS CHANGED 

CONDITION 





0 


_1_ 


2 


3-U 


Reward 


2.2 


h.9 


h.3 


3.3 


Ignore 


1.7 


k.2 


'+.2 


3.h 


Punish 


k.O 


8.8 


8.3 


5.U 



The Skaggs -Robinson Hypothesis is supported by the emergence of the 
predicted, asymmetrical U-shaped function. Errors are minimal for the 
originally learned words, next for the most dissimilar words and most 
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frequent for the intermediate degrees of change. Noteworthy, but in- 
cidental, is the fact that the tachistoscopic presentation was sensitive 
to the punishment treatment where the original learning situation had 
r/jt been. 

By token of the several previous presentations, it should be ob- 
vious that, in this instance, all three dimensions of variation turned 
out to be significant: learning condition of reward -punish- igiiore, num- 
ber of letters changed in tachistoscopic recognition and the interaction 
of the two main treatments. 

MAGNITUDE : MORE THAN TWO n SIMULTANEOUS 11 
DIMENSIONS OF EXPERIMENTAL VARIATION 

Sometimes investigators, possibly unfortunately, decide to throw 
a number of experimental vegetables into the design stew at the same time. 
There are alternate strategies. One is to piece out the research with a 
number of sub-experiments with their cross -comparisons and their econom- 
ic shortcuts of pivoting several experimental groups on one control. 
This i.s clearly the present position. The alternative is to throw all 
variables into the pot at the same time and see what emerges. A case 
supporting the latter view can clearly be made, but there are a couple 
of objections. One is statistical, namely, that anova procedures are 
quite (maybe overly) sensitive to outlying cases and other data aberrations 
so that the analysis can be thrown off and distorted, leading to inappro- 
priate conclusions. The other objection - in addition to complications # of 
design and statistics - is more behavioral, namely, that with many variables 
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applied together it is frequently difficult to "see" what has happened. 
The human eye obviously has its limitations and it is very difficult if 
not impossible to determine what differential action has occurred when, 
say four or five variables and their several interactions are operating. 

With these comments made, let us consider the kinds of experimental 
cases in which complex anova might be applied where three or more dimen- 
sions of experimental variation are employed. In an educational setting 
one might be interested in studying learning rat** as it covaries with age, 
sex, socio-economic status, sibling position, presence or absence of 
parent(s), and characteristics of the examiner. This project could be 
accomplished by a series of experiments based on comparable samples of Ss 
involving one or two variables at a time and cross comparisons along the 
relevant dimensions or all variables could be applied together. The latter 
involves considerable pre -experimental planning. With the six variables 
cited there are a minimum of 15 sub-groups to consider. The problem of 
procurement is real and beco*~ more pressing when it becomes apparent 
that variability in an experiment such as this is likely to be great and 
fairly large Ns needed. 

Another case in point consists of some data that came to hand re- 
cently involving the covariation of grade, sex, birth order and their in- 
teractions against IQ, achievement and scores in reading, language, arith- 
metic and total score along with Grade Point Average. 0u+ of this verit- 
able hodge-podge (some measures were missing for some Ss) there emerge some 
k2 F-values from wiova along with various and sundry other numbers. Some 
of the values are ignificant, some insignificant. Replication yielded 



comparable findings. In fairness to the investigator, it should be noted 
that the variables and i elationships he focused on held up across the two 
studies. The simpler way would clearly have been to select out a couple 
of variables and a couple of measures and run the partial (but main) in- 
complete block design utilizing only a small portion of the variables. 

It would seem that, unless the investigator really doesn't "know" 
the effects of his several variables and is therefore simply indulging 
in experimental fishing, it is wise to put the effort into judicious se- 
lection of variables and measures and cut the design down to workable size 

To illustrate complex anova, let us consider a hypothetical case. 

Sex and The Squirm Test . Suppose one were interested in the influ- 
ence of sex content in a motion picture on behavior. Obvious variables 
to build into the design are the sex and age of the audience. One way 
to conduct the experiment is with large Ns; another with small. We will 
take the latter. Individuals or groups might be matched on the basis of 
previous exposure to movies with and without sex content and the like, but 
for simplicity purposes we will take independent groups with sub-Ns of 
five. There emerges a 2 X 2 X 2 factorial design with two conditions of 
each of three dimensions of variation: l) movie with and without sex con- 
tent (one would take appropriate control action with regard to duration, 
other content, seating arrangements, etc.), 2) Male and female participant 
and 3) older and younger Ss. Thus there would be eight groups of five Ss 
each. 

A major consideration, as always, is the index of behavior to be 
employed. Attitudes toward the rnovie are one facet of behavior, but a 
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more direct index is called for. The measure proposed is the Squirming 
Response consisting of the amount of movement exhibited by the audience 
in their seats. It would be relatively easy to doctor the individual 
seats with recording devices that would be sensitive to and pick up 
slight body movements. If one wished to get fancy about recording, photo- 
graphic records could be taken of the facial expressions and movements of 
each individual S, but the short-coming of this procedure, as always in 
this instance, is the enormous effort that has to be expended in analyz- 
ing the data. Records of squirming would include body contacts with 
neighbors, particularly of the opposite sex and these possibly should be 
partialled out for separate treatment, but a line has to be drawn somewhere. 

Hypothetical data are presented in Table 23 where careful inspection 
shows more activity for the movie with sex content and in that framework 
mere movement for males and younger people. There is thus a clear sugges- 
tion of an interaction effect between the movie variable and audience sex 
and age. 

Sub-comparisons by the several techniques previously spellei out are 
clearly appropriate, but for purposes of exposition the \rritev waded 
through the classical comolex anova procedure for treating these data, 
(incidentally, ^.t took a long half hour to complete the analysis plus some 
1+5 minutes of a graduate assistant's time in replicating the analysis. All 
sub-comparisons by more efficient shortcut techniques took a total of less 
than 15 minutes; looking at column and row sums for inferences took less 
than five minutes.) From the overall analysis for main effects only the 
movie emerged significant. The major variables of sex and age of audience 



Table 2? 



SBC AMD THE SjgJMjggS BESKHSB 
Hypothetical squirming responses to a aerie with and without a 
heavy sex elewrat tor audiences split hf age and sex. 



MOVIE 
WITH SEX 



Male 



Old 



Old 



9 


7 


8 


k 


8 




8 


k 


7 


% 


7 


3 



without SEX 



Mile 
Young OM 

3 6 
2 5 
2 5 



Young Old 

k 7 
* 6 
3 6 
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did not hold up. This finding makes sense in terms of the raw data of 
Table 23. Here it is clear that the totals for sex alone and for age 
alone are approximately equal indicating practically no effect of these 
variables . 

On the interaction side of the fence it was, however, a different 
story. In line with inspection the first order interactions of movie 
with audience sex and movie with audience age both turned up to be highly 
significant. The age-movie interaction had considerably more impact and 
accounted for much more of the variance than did the movie-sex interaction 
in conformity with expectation from inspectional analysis. The second 
order interaction of all three variables was quite insignificant. 

This? experimental example of sex and the squirm test is a relative- 
ly simple one and the data have been doctored to yield clearcut effects. 
Such is not the case with many other instances where single deviant cases 
or peculiarities of interaction turn up. This point will now be illus- 
trated . 

It is quite an easy matter to criticise published articles - even 
one's own - on methodological and statistical grounds. As a matter of 
fact it is relatively easy (and dramatic for undergarduates ) to open their 
textbook to any page and find something wrong. The words are easy; the 
numbers are hard. For many years I have given students in graduate re- 
search and methodology seminars articles in Japanese with the purpose of 
reconstructing the experiment from the tables of data. A moderately soph- 
isticated observer of the behavioral scene can do this with little trouble. 
Another tour de force I have conducted is to have students open any issue 



of any journal to any page containing tabular material and work -t over. 
In roughly two-thirds of the cases we have been able to find somethii..^ 
"wrong", mostly minor, but sometimes major. 

I found a copy of the Journal of Experimental Child Psychology for 
October, 1966. It opened most easily to page 253 which revealed a table, 
Table 2, entitled "Mean number of responses made during each minute of 
reinforcement period". I examined the data without referring to other 
parts of the article and drew my conclusions without manipulating the 
averages. (No indices of variability were presented). I then looked at 
the authors' (Stevenson and Odom) conclusions. They did not agree with 
my interpretations so I went back and looked at the "problem" and "method" 
sections of the paper. Some quite interesting angles emerged. It turned 
out to be a rather complex experiment entitled "Visual reinforcement with 
children". In it the investigators tried out the effects of Examiner sex 
differences along with the sex and age of Ss who were 192 boys and girls 
ages 6-7 or 10-11. The task to be learned was a lever-pressing one with 
pictures, colors and line drawings presented as reinforcement on the aver- 
age of once every 20 responses. There were thus two conditions for each 
of three dimensions of variation and three conditions for the reinforce- 
ment dimension. 

Initially, in examining these data my focus was on the treatment 
variables, but it shifted after consideration of their data concerning 
operant level performance ("base rates"). Table 2k contains the mean base 
rates and, parenthetir.a3.ly, the comparable figures for the period of re- 
inforcement. 



table 2U 

"Mean base rates" (operant level) of lever-pressing In 192 chil- 
dren as a function of age of S and sex of E and S. The figures 
in parentheses are corresponding means for the reinforcement 
period when pictures were presented. 

(Stevenson and Odcm, 196*0 

SEX OF E 

SEX OP S Male Female 

Male 

6-7 years 72.0 (72.3) 59.2 (69. 1) 

10-11 years 78.2 (82.9) 56.3 (70.U) 



Female 



6-7 years 63.0(60.1) " 55.2 (61*.2) 



10-11 years 



71.9 (73.3) 



1|6.2 (50.0) 
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The eye-catching feature of the numbers in Table 2h axe their simi- 
larity. In other words, if learning took place, the indications for it 
are limited. The average gain in means from operant level determinations 
to conditioning was ca. five responses, a matter of a shade over 7$. One 
cannot fail to be impressed with the small order of magnitude of these 
numbers. By no stretch of the imagination do they indicate an appreciable 
amount of acquisition. To complete the picture it would be nice to have 
at hand the count figure of the number of Ss showing increments in respond- 
ing during reinforcement. Another special feature of these findings is 
that in extinction a number of the groups showed an increase in reaction 
contrary to the usual decremental effects. 

The other noteworthy item in Table 2k is the Examiner variable. Far 
and away the largest differences contained in this representation are as- 
sociated with this source of variation. This point holds across age and 
sex levels of Ss and across from operant level through conditioning into 
extinction. Response level was higher with the male E than with the fe- 
male. The investigators 1 elaborate analyses of the data support what the 
naked eye can see clearly. The authors report no other significant dif- 
ferences in the data although inspection shows a consistent trend for male 
groups to respond at a higher level than female. All eight of the dif- 
ferences for comparable means in Table 2h are in this same direction. 

The point clearly to be made in connection with this paper is not 
that the investigation was "wrong" or valueless or anything of the sort. 
It is clearly a worthwhile piece of research. It seems apparent that the 
data were not exhaustively examined and the small differences between un- 
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conditioned and conditioned responding were not taken into account. This 
matter does not effect the conclusion concerning Examiner effects direct- 
ly. Indirectly, it makes one wonder about the potential generality of 
the Examiner differences. It is important to note the small amount of 
learning accruing to these procedures. It raises a number of solid para- 
metric questions concerning the basis for minimal acquisition. 

The daha of this study point up and capitalize the basic need for 
careful study of numbers reflecting behavioral changes before elegant 
statistics are applied. A major facet of the analysis of this study 
should have focused on the pre-conditioning-conditioning differences. In- 
spection would have pointed the way. 

It is frequently a profitable exercise to lay out a research program 
on a grand scale throwing all the variables and potential variables into 
the pot - on paper. At this point it is also a worthwhile intellectual 
and didactic exercise to spell out the large scale '3?sign that would be 
translated into experimental practice given adequate funds, time and per- 
sonnel. Having laid out this overview, it is extremely wise to then study 
it carefully to insure that the variables included are worth the experi- 
mental trouble, that others haven't investigated them thoroughly, and so 
forth. What frequently happens is that half or more of the grand design 
can be sloughed off ab initio . Further consideration of the remaining ma- 
trix of variables sometimes suggests a priority listing such that some sub- 
experiments emerge as more basic and appealing to the investigator than 
others. Wading through this somewhat tortuous process may leave only 10$ 
of the original overall program, but it will be the heart of the matter 
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for the particular investigator at that particular point in his research 
career . 

Again, caveat emptor re large numbers. Large numbers of variables 
have many of the same disadvantages as large numbers of Ss plus the fact 
that the situation is sensitized to enhance the role of chance by their 
use. Quantity by no means insures quality. As a matter of fact it may 
well mitigate against it. 

MAGNITUDE : ANALYSIS OF COVARIANCE ( ANCOVA ): 
PARTIALLINGr OUT THE EFFECTS OF ONE VARIABLE UPON ANOTHER 

There are many instances in behavioral research where a variable in- 
fluences behavior that is not part and parcel of the experimental treat- 
ment. The investigator is sometimes "aware" of the action of such vari- 
ables and in such instances, of course, makes his behavioral measurements. 
There are basically two cases of this kind. The first is where the inves- 
tigator has introduced a pre-test or selection or matching variable and 
for unknown reasons (presumably "chance"), the situation goes awry and the 
groups do not come out equivalent on the initial measure. When this lack 
of equivalence is large, and particularly when it stacks the deck in favor 
of the experimenter's hypothesis and expectfed direction of his treatment, 
some corrective procedure has to be applied. Since these events are post 
facto , i.e., occur after the initial measurements, the corrective is sta- 
tistical. Verbally the technique is straightforward. The initial and 
final measures are subjected to independent statistical analysis. Then 
the relationship (correlation) between the two is statistically handled in 
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such a way as to partial out differences in the initial measures from 
those in the final behaviors. The arithmetic is a little complicated, 
but this is the essence of the procedure. For example, a substantial dif- 
ference in terminal behavior may be washed out when an appreciable initial 
difference in the same direction is taken into account. 

The second instance in which initial differences have to be con- 
sidered in assessing final ones involves the case where the nature of the 
experimental treatment is such that it has an appreciable impact on both 
initial and final measures. (A related case is that of a variable that 
free-floats and is not under experimental control where E can simply meas- 
ure its behavioral consequences). For instance, certain treatments, e.g., 
partial reinforcement and distribution of practice have notable influence 
on both conditioning and extinction. The investigator may be interested 
in the "pure" effect of such treatments on extinction, say, uncontaminated 
by the influence of the procedure on conditioning. In this instance, he 
may wish to remove the effects, statistically, of the influence of the 
variable on the earlier measures from that on the latter. 

There is one special but basic problem that must be considered in 
this context. It makes a great deal of difference in which direction the 
experimental treatment influences the two phases of measurement. It may 
increase response strength in both, decrease it in both or act differential* 
ly to increase it in the one and decrease it in the other or vice versa. 
A clear case in point is partial reinforcement. Here lowered frequency 
of reinforcement in conditioning tends to retard learning and generate a 
lower level of stabilized responding after conditioning is complete as con- 
trasted with a higher frequency of occurrence of the reinforcing stimulus. 
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In extinction, however, the situation is exactly reversed and the group 
with the lower frequency of reinforcement yields greater resistance to 
extinction than the comparison group with more reinforcement in condi- 
tioning. If one is focusing on extinction there is no problem. Dif- 
ferences favoring greater resistance to extinction in the partially rein- 
forced group emerge despite this group's lower level of performance in 
extinction. If one is a purist, one might wish to still apply the analy- 
sis of covariance procedures, but by the nature of the situation such ap- 
plication can do nothing but merely enhance the extinction differences 
favoring the partially reinforced group. In the limiting^ case where non- 
overlapping distributions appear, correction is clearly a waste of time. 

Some previously unpublished data illustrate this point. The reason- 
ing behind this experiment was that, on a generalization and generalization 
decrement basis, the more conditioning is made like extinction, the greater 
the resistance to extinction. The two essential ingredients of prolonged 
extinction are absence of the reinforcing stimulus and a low level of re- 
sponding. These were approximated in conditioning by teaching pigeons to 
wait between responses. (This could be described as an experiment in damp- 
ing out "impulsivity" • ) This training was not easy because of the ballis- 
tic nature of the pecking response. In training,' reinforcement was pre- 
sented for the E-group only after longer and longer pauses. This shaping 
was continued until the E-birds were making about 30 responses per hour or 
one every two minutes on a partial reinforcement basis. 

The Control Group is a problem in an experiment such as this. To 
use the typical 100j& reinforcement control seems like working in an entire- 
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ly different universe. In conditioning the control group would be peck- 
ing and eating most of the time while the E-group would be standing around 
waiting. The situation could be compared to a baseball player and violin- 
ist in that they both use their hands, but in obviously different capaci- 
ties. For this reason it was decided to use a fairly infrequent aperiodi- 
cally reinforced group as the control so that the two sets of data in ex- 
tinction could be plotted on the same axes. 

The results are summarized in Table 25 where it can be seen that 
there is little comparability between the E- and C-groups in either con- 
ditioning or extinction. Conditioning roaponses for the Control birds ex- 
ceed those of the E-birds by a factor of 60 and in extinction by one of 
nearly 10. There are two different ways of tackling these data. The hard 
way is to perform the ancova testing for significance in conditioning and 
in extinction separately and then teasing out the influence of the first 
on the second by way of the correlation between the two sets of data. This 
involves a* lot of arithmetic. The easy way was the one followed, namely, 
to percentagize each bird's extinction behavior over his conditioning be- 
havior. Actually, in this instance ratios were simply taken of per-hour 
performance in extinction over the same figure for conditioning. This 
procedure accomplished at least two things: it cuts back on variability 
and it takes into account any correlation extent between conditioning and 
extinction. In a sense it is a simple form of ancova . 

The results are dramatically reversed. In the ratio figures the 
E-birds exceed the C ones by a factor of 10. It might be noted in passing 
that the treatment "worked" in the sense that the birds for which condi- 



Percent agizing behavior : Conditioning and extinction responses 
per hour for a C- and B-group where conditioning was made like 
extinction for the E-group. 

(Jenkins, 1955) 

COHDrnOHlUS extimction botimctiom/cohditiqmihg 

C E C E £ I 

3500 23 154 10 4.4 43.5 

2700 29 115 12 4.3 41.4 

2200 37 83 17 3.8 45.9 
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tioning was made like extinction continued to respond and even after 36 
hours of extinction were clipping along at about one-third of their con- 
ditioning rate - and well above operant level. 

Analysis of the data of Table 25 is quite straightforward. The 
Arrangement Technique yields a P-value of .05 for the two non-overlapping 
sets of three events in the ratio figures. If one wishes the arithmetical 
exercise, the classical L-value is greater than 25 and is quite clearly 
highly significant for the four degrees of freedom involved. The Range 
Test is definitely appropriate tc these ratios and yields a highly signif- 
icant value exceeding UO. 

In this instance there is serious question whether the classical 
ancova is applicable. A correlational term based on two sets of three 
cases had very little relational meaning. It is recommended that wherever 
the data resemble those cf Table 25, some form of percentagizing procedure 
be employed both for simplicity's sake and for that of statistical sensi- 
tivity and minimal arithmetical error. 

Basic questions have previously been raiccd about complex designs 
chiefly concerning the interpretation of the data of such items as higher- 
order interactions. This same criticism and caution applies to the ancova 
situation. Foi instance, significance may not emerge in the "before" meas- 
ures, but the behavior may head in the ^ame direction here as in the after 
measures so as to spuriously enhance significance in the "after" effects. 
Again, correlations can be slippery tilings and the ancova case is no ex- 
ception. For example, suppose highly differential correlation exists be- 
tween the experimental anri control groups across the "before" and "after" 
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measures. Should one combine them anyway disregarding the differential 
or is it reasonable to convert to z-scores and combine? Suppose the cor- 
relation for the C-condition is positive and that for E negative and the 
difference in correlations is significant by usual standards? These and 
many other questions should make one think several times, not just before 
applying ancova, but more importantly before designing an experiment ap- 
propriate to the ancova procedure. 

As a case in point during World War II an investigator, quite logi- 
cally, tried out a new pilot selection instrument by testing neophytes 
and skilled pilots on it. The two distributions of measurements practical- 
ly did not overlap. When the two sets of scores were combined and the 
overall distribution correlated wixh an outside criterion, the resulting 
correlation was high ant positive although it was near zero for each 
group separately. By combining two distributions apparently drawn from 
quite different parent populations, a markedly spurious correlation was 
generated, as witnessed by the low correlations produced by taking each 
group individually against the outside criterion. This situation illus- 
trates the kind of complication ancova can run into. More pertinent ex- 
amples will be presented later. 

To return to the ancova type set-up reference back to Table 17 is 
another case in point. It exemplifies a repealed measurements arrange- 
ment in which one group of birds was trained and tested under distributed 
practice conditions and the other under massed, Ancova could be applied 
in one of two ways to these data. First, only the behavior of the first 
extinction session could be partialled nut of that of the third extinction 
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session so that initial differences favoring the massed condition would 
be corrected for in the reactions of the third extinction session also 
favoring the massed condition. (The percentagizing procedure from first 
to third sessions was suggested in that context and it will be recalled 
that differences indicating faster extinction for the distributed con- 
dition held up after conversion to percentages.) The other way in which 
ancova could oe applied to the distribution of extinction experiment con- 
stitutes the most complex analysis that can be applied to these data. It 
involves partialling out extinction behavior in the first two sessions 
from that in the third session. This step becomes quickly tricky and 
sticky because it involves appreciably increased variability and differ- 
ential correlation not only across the distribution of extinction variable, 
but also from the 1-3 and 2-3 sessions correlations. Application of an- 
cova is likely in this instance to result in a mishmash, statistically 
speaking. Percentagizing seems like far and away the: most efficient and 
statistically sensitive procedure for the numbers of Table 17. 

The overall point here is that any repeated measurement set-up where 
initial differences emerge can be considered an ancova arrangement. On 
a few rare occasions, traditional ancova may be necessary, but in most 
instances a percentage conversion procedure will do the job faster and 
more efficiently. It also follows that any classical transfer of train- 
ing design involving before and after-treatment measures can turn into an 
ancova set-up if the investigator gets a bad break and the groups do not 
turn out to be initially equivalent. Savings scores were designed to take 
care of this complication and do. 
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To illustrate the complications of the ancova design, a hypothetical 
(but not unrealistic) experiment was designed involving the application 
of four different science curricula to second-grade public school pupils. 
Bypassing the large difficulties involved in selection and constructing 
the curricula, training the teachers and various other pieces of adminis- 
trative spade work, it is assumed that a well -designed study was laid out 
in which pre-experimental science knowledge was determined by a pre-treat- 
ment assessment measure and an IQ, index of intellectual ability was employ- 
ed. After exposure to the experimental curricula, a post -treatment meas- 
ure was applied to reflect changes in science knowledge, methodology, 
attitude and philosophy. 

While it clearly would be far more reflective of the data to have 
individual scores on at least a sub-sample, we will settle for means and 
standard deviations as representative of trends in the hypothetical data. 
This information is summarized in Table 26 as . it would be summarized in 
a journal article. 

The data of Table 26 fall within very realistic limits for this type 
of experiment. It might be noted in passing that the hypothetical resiilts 
have not been complicated by differences that well might emerge such as 
socio-economic status, number in family, residence and presence or absence 
or parents. The first noteworthy item is the considerably greater loss 
of Ss in the Ganma group as contrasted with the ethers. Since all groups 
started with an N of 100 this attrition contiibutes an appreciable unknown 
and unfortunate bias to the data. They presumably would be analysed any- 
way. 



Table 26 

Hypothetical data on the influence of four different science cur- 
ricula on a science post-curriculum test in the second grade. A 
pre -curriculum test art an IQ. test were given initially. 



SCIENCE CURRICULUM 

Alpha Beta Gsaaa Onega 

Initial N ICO 100 100 100 

Pinal N 95 98 63 & 
Pre-Test 

Mean 50.3 U9.8 60.7 5^.2 

SD 12.5 15.9 20.5 17.8 

IQ Mean 98.4 101.3 108.7 103.6 

SD lk.8 18.7 13.^ 15.3 

Fust -Test 

Mean 51.7 56.9 73.4 75-7 

SD 10.4 14.6 18.9 22.3 
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Turning to the main findings, the post -treatment test performance, 
there seems to be an appreciable difference in the extremes (Alpha vs. 
Omega) of the rough order of one and one-half standard deviation units. 
This in itself should be significant, but there are a number of other con- 
siderations before any conclusions can be drawn. The high wastage rate 
in Gamma cannot be ignored, but the alternative is to throw the whole 
group out and this seems frightfully wasteful, particularly since experi- 
ments of this kind take at least a year to plan aorid another year to con- 
duct . 

Of considerable import are the relatively large differences apparent 
in the mean science pre-treatment scores. The maximal difference amounts 
to about half a standard deviation, a magnitude not to be ignored. More 
basically, Gamma and Omega, which have the highest pre-treatment test 
scores, also have the highest post-treatment scores. A correction must 
be introduced for this event since post -treatment differences may reflect 
in large part differences in initial knowledge of the pupils regarding 
science. Furthermore, mean IQ's show the same trend, that is, the groups 
which have the higher post-treatment test scores also have the higher in- 
itial IQ, averages. Again some correction must be introduced or the final 
differences may simply reflect greater intellectual ability alone or com- 
bined with greater science information. 

The data are incomplete for an analysis of covariance as they stand. 
The correlation (or at least the S-by-S data or the cross products) are 
needed between pretest and IQ on the one hand and post-treatment performance 
on the other • Given incomplete information, let us see what can be retrieved. 
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While the post-test scores differ appreciably across groups, it seems 
likely that they are contaminated by pre-test and IQ differences * Con- 
sidering two sets of differences simultaneously - across groups or con- 
ditions and across tests or measures - it is obvious that, although the 
post-test differences are of the order of 1.5 sigma units, the pre-test 
and IQ differences are of the order of around a half a sigma. Further- 
more, it is quite reasonable to assume a fairly substantial positive cor- 
relation between scores on the two initial devices and the post -treatment 
index. Given this information and these assumptions, it seems quite 
plausible to expect that the post -treatment differences will be highly 
diluted when pre-treatment test scores and IQ are partialled out. From a 
practical standpoint, one must - in the absence of other considerations - 
recommend treatment Omega for use, since the greatest science test gains 
accrued to it, the sample stayed fairly intact and the average IQ is nob 
too far out of line with the lower groups. Gemma has the disadvantages 
of high attrition for unknown reasons, the highest mean IQ and an appre- 
ciably lesser gain on the science measure. 

Replication is clearly called for and the refinements are obvious. 
Smaller Ns should be employed with groups matched on pre -treatment test 
scores and IQ (if these relate to post -treatment performance on an appre- 
ciable scale), and some consideration of socio-economic features and as- 
sociated items along with factors contributing to attrition. If Alpha 
and Beta correlate substantially, one should be dropped and the one re- 
tained that takes the lesser time and effort to teach the teachers and 
to administer. With this matched group design, direct comparison could 
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be made of post -treatment test performance. 

It is time to pull together the essential steps in analysis of co- 
variance. First, it should be noted that the ancova procedure adjusts 
the effects of the experimental treatment, that is, the post-treatment 
measure of behavior, according to behavior on the pre-test or pre-treat- 
ment measures. The extent of adjustment depends on three items: l) the 
extent of correlation between the pre- and post -treatment measures; 2) the 
size of the difference in the groups on the pre-treatment measure; and 
3) the magnitude of the difference in behavior on the post -treatment index. 

The actual steps in ancova, while somewhat cumbersome and lengthy 
arithmetically, are quite straightforward. First analysis of variance is 
accomplished on the post-treatment measures; then on the pre-treatment ones. 
Next anova is applied to the cross-products of the two measures. Finally, 
in the ancova analysis, the post-treatment differences (on which the ex- 
perimental focus falls) are adjusted or corrected for the pre -treatment 
differences and the correlation between pre and post measures. 

A couple of side comments are called for. Ancova assumes linear re- 
gression, i.e., a linear relationship between the variables involved. This 
is sometimes the case, sometimes not. When non-linear regression prevails, 
re*l problems are posed for ancova. A second point concerns the use of 
more than one supplementary or pre-treatment measure as in the educational 
experiment depicted in Table 26. Matching or selection on more than one 
pre-treatment variable rapidly reaches a point of diminishing experimental 
returns. As a matter of fact it is rare that more than one variable will 
appreciably contribute to the picture and the use of more than one many 
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times complicates the situation not only arithmetically, but compounds 
and confounds the complications of relationships among the several meas- 
ures. More than one matching or selection variable is not recommended un- 
less the very special case arises of two relatively uncorrelated indices 
both of which correlate with the criterion or post-treatment measure. If 
more than one variable is necessary, matching by pairs or by groups be- 
forehand is far more efficient than ancova. 

At this juncture we are ready for a real-life experimental example 
that demonstrates the complication, difficulties and short-comings of an- 
cova as well a,s pointing up matters of design in the ancova setting. W.E. 
Morris (1953) experimentally examined the problem of teaching the analysis 
of language or communication of written messages by v/ay of Charles Morris 1 
types of discourse. He employed the traditional transfer of training 
model with a pre-test, treatment and post-test. He first pre-tested 
university graduate and undergraduate students on comprehension cf written 
discourse. The experimental groups were then given training on Charles 
Morris f types of discourse while the control groups were exposed to the 
materials of Feigl and Osgood, (it should be noted that it was considered 
unnecessary to run the "pure" control groups of (no treatment whatsoever) 
and thus the cards ab initio were stacked for reduced differences across 
groups.) A post -treatment test of "language comprehension" was then ad- 
ministered with time devoted to the post-test recorded. Post facto no 
systematic differences emerged among the various control conditions and 
they were lumped to simplify the analysis. Three experiments were con- 
ducted; the first involved 10 Morris Ss and 18 controls ("Others"); the 
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second 6 and 17; and the third 9 and 9. 

In this ancova design there are two predictor, selector, matching 
or co-variables: pre-test or pre -treatment performance and time on post- 
test, (it is obvious that all other things equal an S devoting two min- 
utes to the complex post -treatment task is not likely to do as well as 
one giving it 30 minutes. The experimenter found it took him 30 min. to 
complete the post-test.) The criterion is, of course, post-treatment test 
performance. The basic experimental issue is whether or not the Morris 
treatment contributed more to the understanding of language than did the 
other procedures. 

There would not be problems of treating the data from these experi- 
ments if pre-treatment performance and time on the post-test were equiva- 
lent acorss the Experimental and Control groups. In this case a direct 
and simple statistical comparison could be made on post-test performance. 
3ut they were not comparable, and recourse had to be made to ancova. (A 
couple of relatively minor complications will be ignored such as loss of 
a S or two who do not comprehend the instructions and some heterogeneity 
of variance . ) 

Table 2? summarizes the results from the three separate experiments. 
An overview of this vast mass of numbers suggests little systematic trend 
in the data. (The clear superiority of Ss i; Experiment III can be ignored 
for the present purposes since it is attributable to refinements in tech- 
nique and the use of graduate students.) More thorough examination of the 
data suggests higher post -treatment performance in the Morris group al- 
though the effect is by noumeans glaring. The statistical problem is to 
pin this apparent difference down in the light of other covariations. Or at 
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An empirical example of ancova: 
Morris' types of discourse (1953) 

MORRIS 
TREATMENT 

EXPERIMENT I 

N 10 
Mean Post -Test 19. 7 

Mean Pre-Test 15. 3 

Mean Time 15. 3 



Vi.E. Morris experiments on Charles 



FEIGL , OSGOOD AMD 
OTHER TREATMENT 

18 
16.8 
15.2 
13.9 



EXPERIMENT II 

N 6 17 

Mean Post-Test 19.O 12.1 

Mean Pre-Test 17.0 10.5 

Mean Time 13.5 13-5 

EXPERIMENT HI 

N 9 9 

Mean Post-Test 31. 6 31.9 

Mean Pre-Test 29.6 28. k 

Mean Time 18. k 11.2 
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least the data must' be squeezed dry to soo what, if anything, happened to the 
behavior of the Ss differentially treated by the different language sys- 
tems . 

The first complication arises because the differences in post-test 
time, though small, are apparently real in a statistical sense. Thus 
they must be taken into account in the final analysis. 

A second and somewhat more basic difficulty arises when the correla- 
tions among the several measures are examined. W.E. Morris correlated pre- 
test performance and post-test time with the criterion of post -treatment 
performance. While the Ns are relatively small, the trends are clear. As 
a sample, the intercorrelations for Experiment I follow. 

MORRIS OTHERS 

Pre-test Time Pre-test Time 

Post-test .68 .08 Post-test .13 -.08 

Pre-test -.05 Pre-test -- -.53 

A clearer set of differential correlations is hard to come by. In 
the Morris group pre- and post-test correlated high positive; in Others 
it was near zero. The pre-test-time correlation in the Morris condition 
was near zero while it was substantially negative in the control case. 
Serious doubts are immediately raised about the legitimacy of combining 
the two sets of correlations for analysis purposes. 

For each experiment, W.E. Morris separately computed the multiple 
correlation between the predictors and criterion, that is between pre- 
treatment test performance and time on the one hand and post -treatment 
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test scores on the other. The multiples were calculated, of course, separ- 
ately for the Morris and Other conditions. These multiple correlations 
follow: 

MORRIS OTHER 
EXPERIMENT TREATMENT TREATMENT 

I .63 .13 

II .84 .10 

III .73 -385 

• 

Again, the contrast in these relationships "between the tv/o conditions 
is striking. The Morris 1 multiples are quite high; the Control figures 
relatively low. The finding that these two sets of data are cleuXly drawn 
from different parent populations, dictates the use of some statistical 
technique other than ancova. But as in many complex experiments, the in- 
vestigator wishes to test out his data from every angle so ancova was ap- 
plied by way of correcting or adjusting post -treatment test scores for 
both pre-treatment performance and time devoted to the post-test. None 
of the overall outcomes exceeded the 20$ level of significance. 

At this point investigators are faced with several possibilities. 
The obvious solution is to design and conduct an improved experiment, 
(One ridiculous possibility is to forget the whole thing - a hasty mistake 
that might be made if follow-up analyses were not conducted . ) There are 
however, other cons iderat ions that dictate communication of the state of 
the art to the public or, at least to that portion of it known as the grad- 
uate school. To exhaustively analyze the data after application of these 
complex procedures, Morris selected out two sets of eight Ss each who 



ERLC 



-100- 



matched up on pre -treatment test performance aaid on time employed in the 
post-treatment test. One set of eight was from the Morris condition and 
the other from the Control. The two samples turned out to be nearly per- 
fectly matched on these two dimensions and a difference of nearly two to 
one emerged in post-treatment test performance: 26.8 versus 13.8. The 
two distributions of test scores overlapped very little and the difference 
by a t-test was significant at about the 1% level. One might read this 
as salvaging data - 16 Ss retrieved from an initial total of 69 - but it 
appears to be far more than that. It demonstrates that in the complex 
area of language interpretation where few behavioral principles are known, 
one system of analyzing communication is superior to others. At the very 
least it sets the stage for a program of research in this area. 

W.E. Morris introduced one "experimental 1 ' gimmick that did not appear 
in the dissertation. After wading through the seemingly innumerable and 
complex analyses, he went back and culled out Ss he knew personally. These 
he sorted into two piles: those he judged to "like" him and those he jud- 
ged not to. Comparing the performance of these two groups yielded some 
intriguing data. Not only did the "not-likes" take appreciably less time 
in perf orTfiing on the post-test, but they scored at a lower level across the 
board. Following up on this finding (?) an eye-ball examination of the 
data indicated that almost all experimental Ss who spend 5 min. or more 
per item on the post-test performed significantly above chance while those 
spending less than 5 min. did not. On the basis of this reasonable if in- 
triguing finding, an additional experiment was carried out. Using the two 
most reliable test items of six (decreasing the time required for testing), 
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Morris cajoled (?) the students of an undergraduate psychology class into 
spending a reasonable amount of time at the task. The results of this 
experiment clearly favored the Morris (experimental) group. 

Attribute those findings to a "personality factor" or whatever, there 
seems to be a basic dimension here for experimental examination - and not 
just with human Ss. Rats do not seem to "like" certain investigators as 
witness increaccd frequency of biting, struggling, escaping from E, and 
very likely special behaviors in the experimental setting. These special 
behavioral features probably are more rampnnt with more complex species 
such as chimpanzees and college students. 

AFTERTHOUGHTS , ODDS AMD AND SOME OVERVIEW MATTERS JN 

experimental DESIGN , METHODOLOGY AM) STATISTICS 

It is not always easy to buck the tide. Many writers today claim 
that improvements cuid refinements in statistical procedures put investi- 
gators in a position to lay out more elaborately designed experiments ^that 
yield vastly increased information. This is a moot point with which the 
writer disagrees wholeheartedly. The growth of a science is reflected 
at first in its increasing complexity of methodology and theory. Where 
one doesn't know much, one speculates widely and tries out innumerable 
things. As science progresses, relative simplicity sets in. Tb^ word 
"relative" is used advisedly. Einstein's basic formulation was superfi- 
cially simple, but enormously complicated in its implications and ramifi- 
cations. This "principle of relative simplicity" holds for both design 
and analytical procedures. Physical scientists are dealing with whopper 
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effects; they don't need statistical manipulations to tease the phenomena 
out. Their order of magnitude is 10 to 1 or 100 to 1 or a million to one. 

This point clearly applies to the behavioral sciences. Right now 
there's a lot we don't know - although it would seem that we know a good 
deal more than we sometimes profess. We do have basic principles for 
manipulating and changing behavior on a large scale - although we don't 
always use them, particularly in the practices of child training and ed- 
ucation. Much of our research, viz. psychotherapy, is a finning expedi- 
tion. We're trying to find variables that change individual beh?.vior and 
measures of behavior that reflect our experimental applications. At the 
same time wo're trying to construct theoretical structures that will handle 
the rapidly accumulating data. 

There are two ways - at different ends of the continuum but not bas- 
ically opposed - for taclOing these problems. One is the overview approach 
in which any variable remotely related to behavioral change is thrown in- 
to the experimental pot along with variables that have theoretical or em- 
pirical foundation. This approach clearly involves complex design and 
elaborate statistics for partialling out the effects and inter-effects of 
the several treatments . The alternative is the classical method of uni- 
variable experimentation where one treatment is applied at a time. This 
procedure is traditional in the fields of sensation, perception, verbal 
learning a la McGeoch, comparative psychology and a few other areas. (The 
"pure" psychophysicist trains and uses one S and sometimes one other for 
replication. ) The multi-variable approach seems more characteristic of 
some of the newer disciplines such as clinical psychology and the social 
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and educational fields. The nature of the problem at hand and the state 
of knowledge behind it will, of course, play a considerable role in de- 
termining which approach will be employed. The current view is that when 
we don't know much, we should play it simple and where we know a good deal 
we are in a position to do the same. After all very few (if any) of the 
great discoveries of science came about by way of multi-varia.ble experi- 
mentation. It can be argued, of course, that this is an historical and 
cultural artifact. There is no answer to this; history can't be experi- 
mented on. 

There are many pressing major problem areas in behavioral science 
among which may be mentioned "mental illness 5 ' and psychotherapy, child 
rearing and educational practices, that nebulous entity area known as 
"motivation" an'} aptitude assessment. Research on a small portion of any 
of these N could fill the experimental lifetime of most researchers. 

The terminal and main point of this paper concerns the role of statis 
tics in these and all other research areas in the behavioral sciences. 
Statistics and experimental design are not synonomous. They are radically 
different matters. Experimental design is paramount, propaeduetic and 
foremost; statistics axe a second or^er of business and are not essential 
to design. Occasionally they help, but, unfortunately they sometimes hin- 
der, mislead and distort behavioral variations . The reader is herewith 
implored to concentrate his efforts on design in behavioral science and 
leave statistics to the statisticians. 
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